pythonmatplotlibcumulative-frequency

Plotting a cumulative histogram with exported data in Python


I am trying to plot a cumulative histogram similar to the one shown below. It shows the number of occurrences (y-axis) of the French pronoun “vous” in a text corpus (x-axis) represented from word 0 to 92,633. It’s been created using a corpus analysis application named TXM. TXM’s plots, however, are not adapted to the specific requirements of my publisher. I would like to produce my own plots exporting the data to python. The problem is that the data exported by TXM is a bit puzzling, and I am wondering how I it can be used to make plots: it’s a one-column txt file with integers.

Each one of them indicates the position of “vous” in the text corpus. Word 2620 is one “vous,” 3376, another one, etc. One of my attempts with Matplotlib :

from matplotlib import pyplot as plt

pos = [2620,3367,3756,4522,4546,9914,9972,9979,9987,10013,10047,10087,10114,13635,13645,13646,13758,13771,13783,13796,23410,23420,28179,28265,28274,28297,28344,34579,34590,34612,40280,40449,40570,40932,40938,40969,40983,41006,41040,41069,41096,41120,41214,41474,41478,42524,42533,42534,45569,45587,45598,56450,57574,57587]
plt.bar(pos, 1)
plt.show()

But this doesn't come close. What steps should I follow to complete the plot?

Desired plot:

desired plot


Solution

  • With matplotlib, you could create the step plot as follows. where='post' means the value changes at every x-position and stays so until the next x-position. The x-values are the positions in the text, a zero is prepended to let the graph start with zero occurrences. The text-length is appended at the end. The y-values are the numbers 0, 1, 2, ..., where the last value is repeated to draw the last step in full.

    from matplotlib import pyplot as plt
    from matplotlib.ticker import MultipleLocator, StrMethodFormatter
    import numpy as np
    
    pos = [2620,3367,3756,4522,4546,9914,9972,9979,9987,10013,10047,10087,10114,13635,13645,13646,13758,13771,13783,13796,23410,23420,28179,28265,28274,28297,28344,34579,34590,34612,40280,40449,40570,40932,40938,40969,40983,41006,41040,41069,41096,41120,41214,41474,41478,42524,42533,42534,45569,45587,45598,56450,57574,57587]
    text_len = 92633
    cum = np.arange(0, len(pos) + 1)
    fig, ax = plt.subplots(figsize=(12, 3))
    ax.step([0] + pos + [text_len], np.pad(cum, (0, 1), 'edge'), where='post', label=f'vous {len(pos)}')
    ax.xaxis.set_major_locator(MultipleLocator(5000)) # x-ticks every 5000
    ax.xaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}')) # use the thousands separator
    ax.yaxis.set_major_locator(MultipleLocator(5)) # have a y-tick every 5
    ax.grid(b=True, ls=':') # show a grid with dotted lines
    ax.autoscale(enable=True, axis='x', tight=True) # disable padding x-direction
    ax.set_xlabel(f'T={text_len:,d}')
    ax.set_ylabel('Occurrences')
    ax.set_title("Progression of 'vous' in TCN")
    plt.legend() # add a legend (uses the label of ax.step)
    plt.tight_layout()
    plt.show()
    

    example plot