Unfolding the Mystery of the Zipf’s Law

The world is a very complex place. It has very complex events. Yet, no matter how complex those events are, an interesting phenomenon called the Zipf’s law happens sometimes within your data. Zipf’s law states that in many situations, the size of objects/units is inversely proportional to a power Pm (Pm > 0) of their ranking. It was named after George Kingsley Zipf (1902–1950) who was an American scientist in the fields of linguistics, mathematics, statistics, and philology. We do not yet fully understand the various processes that cause a Zipf distribution. Yet it has some very significant implications in predictive modeling. In this article, I will explore what the Zipft law is and why is it a very interesting process.

 

What is Zipf’s law?

One of the easiest ways to understand the principle behind the Zipf law is to apply it to the world of languages. Indeed, in the English vocabulary, the word “the” constitutes about 6% of everything we write and say. Interestingly, the word “of” constitutes 3% of what we say and write. It is half as much as the word “the”. The next word “and” will appear 1/3 as much as “the”, the next one “a” 1/4, and so on. 

Using that logic, the word “data” (that is the 531 most common word) will appear 1/531, and “science” (957 most common word) 1/957, and so forth. And if we test the same thing on our essay so far, we realize that the same property somehow applies. (without a big accuracy, due to the shortness of this text). You can try it yourself on your text. These ratios for rare words will work assuming that you are using a sample large enough. So do not expect those exact ratios for rarer words on your 400 words essay.

We noticed that the first five most used words (“the”, “of”, “and”, “a”, “is”) almost follow Zipf’s law although the shortness of our text. Interestingly enough, all words in the English corpus follow this type of distribution. It is quite mystifying to see something as complex as languages follow such a simple rule. No matter how complex your text is, it will tend to fall under such a type of setting. 

With the current grammar structure, it appears that this law is unavoidable and even the most well-written, beautiful, sound, or controversial writing will follow Zipf’s law. The Zipf distribution does not only apply to the English corpus. The Zipfian principle can be applied to French, Italian, German, Arabic, Swahili, etc. which makes this law universal. Moreover, it seems like nothing escapes this law. Even if one started mashing random stuff on their keyboard assuming that a word is separated by a space bar, given a sample large enough, we will end up having a Zipfian distribution.

 

How is the Zipft law applied in the real world?

The real-world applications of the Zipfian principle are just astonishing. Understanding the way the Zift law applies to languages can change the way we learn them.

Plotting the cumulative frequency of the top N words in a couple of languages brings an interesting observation. Looking at the plot, you only need to learn 500 words to understand about 75% of whatever people are saying. To understand 85% of what people are saying in a given language, you need to learn about 2000 words.

So, theoretically, if we take the English language and memorize the 2000 more common words, we will understand almost all of what people are saying (check the figure below). This is very significant because the English corpus contains roughly 150 000 words.

 


library(data.table, quietly = TRUE)
library(ggplot2)
library(poweRlaw)
library(curl)
languages <- c("en","de","es","ru","fr")
words <- data.table(word=character(), 
count=integer(),
language=factor(levels=languages))
for (language in languages){
words_temp <- fread(paste0("https://raw.githubusercontent.com/hermitdave/"
"FrequencyWords/master/content/2016/",
language,"/",language,"_50k.txt"))[1:30000]
setnames(words_temp, c("word","count"))
words_temp$language <- language
words <- rbind(words, words_temp)
}
words <- rbind(data.table(count=c(0,0,0,0,0), 
word=c("","","","",""),
rank=c(0,0,0,0,0), language=languages), words) # so the graph starts at 0
levels(words$language) <- c("English", "German", 
                          "Spanish", "Russian", "French")
ggplot(words, aes(rank, cumulative_fraction,
 colour=language)) +
geom_line() +
scale_color_brewer(palette="Set1") +
scale_y_continuous(labels = scales::percent, 
breaks = seq(0, 1, 0.1)) +
xlim(0, 2000) +
xlab("Top N words") +
ylab("Percentage of all words")
Top N Words by languages
Top N Most Frequent Word by languages

 

Then again the Zipf’s law is a successful attempt to apply the Pareto principle in languages. The Pareto principle is a rather basic statistical function that states that when there is a distribution of elements, it usually follows a 20-80% ratio. That means, for example, 80% of the sales of a business will be made by 20% of its customer, 80% of the words we say is going to contain 20% of all words, 80% of a population will live in just about 20% of the cities, etc. This principle not only applies to languages but as well to people, cities, earthquakes, business, chemistry, and many other fields. Of course, it is not going to be 100% accurate, with all the differences we have in this world, this trend is quite fascinating.

 

Why do we have this Zipf mystery? 

 

Principle of least effort

Zipf himself tried to answer that question with his “Principle of least effort”. In his work, Zipf hypothesized that all of our decisions are based on whichever path appears easiest, or required the least effort. Since languages originate from humans, they should not be an exception. Ergo, it is easier for a speaker to speak a particular language if he had a bounteous vocabulary.  This vocabulary should contain very specific and descriptive words for an object or action.

Yet, as a listener, it would be way easier if it was made of simple repetitive words. As a result, thousands of years of development mixed with our Zipfian mind, we ended up with languages that follow a Zipfian structure. They contain an abundance of small words and an extremely long list of complex words. That long list of rare words contributes to the creation of Hapax Legomena which are words used only once in a particular collection.

 

Preferential attachment

Another possible explanation of the causality of the Zipf law is the concept of preferential attachment. Let us take for example Internet domains. A limited amount of domains like Google, Facebook, Instagram, Twitter, and LinkedIn rule the Internet. With Google being the top in terms of monthly users or pageviews taken from Alexa ranks, while billions of small websites barely get any traffic. 

So if we think about how it works and why it happens, we realize that websites with higher traffic tend to have even more traffic because they are more likely to get recommended by search engines, etc. Thus, the more links a website holds out there the easier it is for that website to be found. For instance, if we look at the map of the internet, we realize that those big hubs exist because a lot of websites point to them. The more websites get created, the more links those hubs receive.

 

Map of internet
Map of the Internet

 

To illustrate this phenomenon, let's assume you are a website owner. You will most definitely have links to Facebook or Twitter rather than Freddycoffeeshop. Therefore, to get traffic from social media, your Zipfian mind prefers the domain with the most link. That is why popular domains get more popular. It is the same reason why rich people get richer. 

 

Implementation of Zipf principle with R.

We can apply the Zipf principle in our predictive models. Entities that have a Zipfian distribution are modeled by using the hierarchical Bayes models. And, simulations can be run to fit its parameters with data. An assessment called the goodness-of-fit will be run with Zipf distribution. And finally, we predict how the system might change over time. This can include at least some parameters of the system such as its mean. Few R packages allow us to work with Zipfian data.

We can first note zipfR which is a package that contains Statistical models and utilities for the analysis of word frequency distributions. The utilities include functions for loading, manipulating, and 16 Point 4 visualizing word frequency data and vocabulary growth curves. The package also implements several statistical models for the distribution of word frequencies in a population. PowerRlaw is also another package for Zipf distribution analysis.

In this last example, we will use the zipfR package to apply the Zifp-Mandelbrot model on Dicken’s Oliver Twist.


library(zipfR)
## load Dickens' works frequency spectrum
data(Dickens.spc)
# compute Zipf-Mandelbrot model 
#from Dickens data and look at model summary
zm <- lnre("zm",Dickens.spc)
zm

## Zipf-Mandelbrot LNRE model.
## Parameters:
## Shape: alpha = 0.387172
## Upper cutoff: B = 0.001315937
## [ Normalization: C = 35.70691 ]
## Population size: S = Inf
## Sampling method: Poisson, with exact calculations.
##
## Parameters estimated from sample of size N = 2817208:
## V V1 V2 V3 V4 V5
## Observed: 41116 14220.00 5002.00 2894.00 2047.00 1466.00 ...
## Expected: 41116 16384.66 5020.49 2699.06 1763.05 1273.92 ...

##
## Goodness-of-fit (multivariate chi-squared test):
## X2 df p
## 3775.631 14 0
## plot observed and expected spectrum
zm.spc <- lnre.spc(zm,N(Dickens.spc))
plot(Dickens.spc,zm.spc, 
xlab="Most common words", 
ylab="Frequency",
ylim=c(0,17500))
legend(27,16000,
c("Observed Frequency", "Expected Frequency"),
col=c("black", "red"),pch= 15,
box.col="white", cex=1)
Model Prediction Results :Observed vs Predicted Word frequencoes
Model Prediction Results: Observed vs Predicted Word frequencies

 

A Last Word ...

Zift law appears in many aspects of our day-to-day life and so understanding when it occurs what it means can be a powerful asset for any data scientist. Let me know what you think about the Zipf principle.

 

Feel free to use any information from this page. I'd appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to   malick@malicksarr.com. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam. 

 

Newsletter

 

 

If you liked this article, maybe you will also like these ones too.

Simulating 10 000 Tic Tac Toe Games
What is the difference between Data Scientist, Data Analyst, Big Data and Machine Learning Specialist?

 

Bibliography

[1] http://www.wordcount.org/main.php
[2] https://arxiv.org/ftp/arxiv/papers/0802/0802.4393.pdf
[3] https://plus.maths.org/content/os/latestnews/may-aug08/food/index
[4] https://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.pdf
[5] https://io9.gizmodo.com/the-mysterious-law-that-governs-the-size-of-your-city-1479244159?
[6] https://github.com/daniel-wells/learning-languages
[7] https://www.linkedin.com/pulse/website-traffic-zipfs-law-shane-parkins/
[8] https://www.sciencedaily.com/releases/2017/08/170810082147.html
[9] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2426418
[10] https://link.springer.com/chapter/10.1007/978-3-642-04235-5_38

 

 

 

 

 

 

Leave a Comment