gnuplotzipf

Plotting a "perfect" Zipf distribution from data on gnuplot


My goal is to have a simple .dat file and, from it, to plot both the actual data and the theoretical points of a perfect Zipf distribution, that is, a distribution where every item has a value equal to 1/(rank).

For instance, my data for most followed Instagram accounts is:

# List of most followed users on instagram
# By rank and millions of followers
# From Wikipedia
# https://en.wikipedia.org/wiki/List_of_most_followed_users_on_Instagram
# rank, millions of followers

1 222
2 120
3 105
4 101
5 101
6 100
7 99 
8 93 
9 86 
10 85
11 80
12 79
13 76
14 73
15 71
16 69
17 67
18 65
19 63
20 63

From another thread I learned that I can just append a new column with the ideal Zipf distribution values per rank (in this case, 222, 111, 74, 55.5 etc) and then run the second plot as ,'' using 1:3 but this requires manually doing the calculation and appending it to the original file and that's the step I'm trying to avoid. Is this possible? How could I extend it to other distributions/calculations of data?


Solution

  • Use stats to calculate the maximum value of the second column with

    stats 'file.dat' u 2 nooutput
    max = STATS_max
    

    Then you calculate the Zipf distribution with (max/$1)

    plot 'file.dat' u 1:2 pt 7 t 'data',\
         '' u 1:(max/$1) w l t 'ideal Zipf'