pythonpython-2.7matplotlibzipf

Constructing Zipf Distribution with matplotlib, FITTED-LINE


I have a list of paragraphs, where I want to run a zipf distribution on their combination.

My code is below:

from itertools import *
from pylab import *
from collections import Counter
import matplotlib.pyplot as plt


paragraphs = " ".join(targeted_paragraphs)
for paragraph in paragraphs:
   frequency = Counter(paragraph.split())
counts = array(frequency.values())
tokens = frequency.keys()

ranks = arange(1, len(counts)+1)
indices = argsort(-counts)
frequencies = counts[indices]
loglog(ranks, frequencies, marker=".")
title("Zipf plot for Combined Article Paragraphs")
xlabel("Frequency Rank of Token")
ylabel("Absolute Frequency of Token")
grid(True)
for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)):
    dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]],
    verticalalignment="bottom",
    horizontalalignment="left")

PURPOSE I attempt to draw "a fitted line" in this graph, and assign its value to a variable. However I do not know how to add that. Any help would be much appreciated for both of these issues.


Solution

  • I know it's been a while since this question was asked. However, I came across a possible solution for this problem at scipy site.
    I thought I would post here in case anyone else required.

    I didn't have paragraph info, so here is a whipped up dict called frequency that has paragraph occurrence as its values.

    We then get its values and convert to numpy array. Define zipf distribution parameter which has to be >1.

    Finally display the histogram of the samples,along with the probability density function

    Working Code:

    import random
    import matplotlib.pyplot as plt
    from scipy import special
    import numpy as np
    
    #Generate sample dict with random value to simulate paragraph data
    frequency = {}
    for i,j in enumerate(range(50)):
        frequency[i]=random.randint(1,50)
    
    counts = frequency.values()
    tokens = frequency.keys()
    
    
    #Convert counts of values to numpy array
    s = np.array(counts)
    
    #define zipf distribution parameter. Has to be >1
    a = 2. 
    
    # Display the histogram of the samples,
    #along with the probability density function
    count, bins, ignored = plt.hist(s, 50, normed=True)
    plt.title("Zipf plot for Combined Article Paragraphs")
    x = np.arange(1., 50.)
    plt.xlabel("Frequency Rank of Token")
    y = x**(-a) / special.zetac(a)
    plt.ylabel("Absolute Frequency of Token")
    plt.plot(x, y/max(y), linewidth=2, color='r')
    plt.show()
    

    Plot enter image description here