[SOLVED] How to add a zipf curve to a bar plot of word frequency?

How to add a zipf curve to a bar plot of word frequency?

plt.figure()
plt.bar([key for val,key in lst], [val for val,key in lst])
plt.xlabel("Terms")
plt.ylabel("Counts")
plt.show()

I have a list of tuples (count, term) that has been sorted in descending order of count (i.e the number of times a term appears in this document), and I plot the data as above. Now suppose I want to show that the distribution of terms violate the Zipf's Law in computational linguistics, can I add a Zipf's curve (f = c / rank) to this plot without altering the x-axis? How?

Solution

Internally, a categorical x-axis is numbered 0,1,2,.... To plot a curve at the same positions, use range(len(lst)) for the x-values. As Zipf's Law calculates its values starting from 1, the corresponding y-values can be calculated as zipf.pmf(p, alpha) where p goes 1,2,3,... and alpha is the zipf parameter. To align with the unnormalized bar plot, these values need to be multiplied by the total. This post can be used to find the most fitting alpha.

import matplotlib.pyplot as plt
from scipy.stats import zipf

lst = [(60462, 'Italy'), (46755, 'Spain'), (10423, 'Greece'), (10197, 'Portugal'), (8737, 'Serbia'), (4105, 'Croatia'),
       (3281, 'Bosnia and\nHerzegovina'), (2878, 'Albania'), (2083, 'North\nMacedonia'), (2079, 'Slovenia'),
       (628, 'Montenegro'), (442, 'Malta'), (77, 'Andorra'), (34, 'San Marino'), (34, 'Gibraltar'), (1, 'Holy See')]

plt.bar([key for val, key in lst], [val for val, key in lst], color='limegreen')
alpha = 1.37065874
total = sum([p for p, c in lst])
plt.plot(range(len(lst)), [zipf.pmf(p, alpha) * total for p in range(1, len(lst) + 1)], color='crimson', lw=3)
plt.ylabel("Population")
plt.xticks(rotation='vertical')
plt.tight_layout()
plt.show()