pythonnumpystatisticsnumpy-randomzipf

Zipf Distribution: How do I measure Zipf Distribution using Python / Numpy


I have a file (lets say corpus.txt) of around 700 lines, each line containing numbers separated by -. For example:

86-55-267-99-121-72-336-89-211
59-127-245-343-75-245-245

First I need to read the data from the file, find the frequency of each number, measure the Zipf distribution of these numbers and then plot the distribution. I have done the first two parts of the task. I am stuck in drawing the Zipf distribution.

I know that numpy.random.zipf(a, size=None) should be used for this. But I am finding it extremely hard to use it. Any pointers or code snippet would be extremely helpful.

Code:

# Counts frequency as per given n
def calculateFrequency(fileDir):
  frequency = {}
  for line in fileDir:
    line = line.strip().split('-')
    for i in line:
      frequency.setdefault(i, 0)
      frequency[i] += 1
  return frequency

fileDir = open("corpus.txt")
frequency = calculateFrequency(fileDir)
fileDir.close()
print(frequency)

## TODO: Measure and draw zipf distribution

Solution

  • As stated numpy.random.zipf(a, size=None) will produce plot of Samples that are drawn from a zipf distribution with specified parameter of a > 1.

    However, since your question was difficulty in using numpy.random.zipf method, here is an naive attempt as discussed on scipy zipf documentation site.

    Below is a simulated corpus.txt that has 10 lines of random data per line. However, each line may have duplicates as compared to other lines to simulate recurrance.

    16-45-3-21-16-34-30-45-5-28
    11-40-22-10-40-48-22-23-22-6
    40-5-33-31-46-42-47-5-27-14
    5-38-12-22-19-1-11-35-40-24
    20-11-24-10-9-24-20-50-21-4
    1-25-22-13-32-14-1-21-19-2
    25-36-18-4-28-13-29-14-13-13
    37-6-36-50-21-17-3-32-47-28
    31-20-8-1-13-24-24-16-33-47
    26-17-39-16-2-6-15-6-40-46
    

    Working Code

    import csv
    from operator import itemgetter
    import matplotlib.pyplot as plt
    from scipy import special
    import numpy as np
    
    #Read '-' seperated corpus data and get its frequency in a dict
    frequency = {}
    with open('corpus.txt', 'rb') as csvfile:
        reader = csv.reader(csvfile, delimiter='-', quotechar='|')
        for line in reader:
            for word in line:            
                count = frequency.get(word,0)
                frequency[word] = count + 1
    
    #define zipf distribution parameter
    a = 2. 
    
    #get list of values from frequency and convert to numpy array
    s = frequency.values()
    s = np.array(s)
    
    # Display the histogram of the samples, along with the probability density function:
    count, bins, ignored = plt.hist(s, 50, normed=True)
    x = np.arange(1., 50.)
    y = x**(-a) / special.zetac(a)
    plt.plot(x, y/max(y), linewidth=2, color='r')
    plt.show()
    

    Plot of histogram of the samples, along with the probability density function enter image description here