I am trying to plot a cumulative histogram similar to the one shown below. It shows the number of occurrences (y-axis) of the French pronoun “vous” in a text corpus (x-axis) represented from word 0 to 92,633. It’s been created using a corpus analysis application named TXM. TXM’s plots, however, are not adapted to the specific requirements of my publisher. I would like to produce my own plots exporting the data to python. The problem is that the data exported by TXM is a bit puzzling, and I am wondering how I it can be used to make plots: it’s a one-column txt file with integers.
Each one of them indicates the position of “vous” in the text corpus. Word 2620 is one “vous,” 3376, another one, etc. One of my attempts with Matplotlib :
from matplotlib import pyplot as plt
pos = [2620,3367,3756,4522,4546,9914,9972,9979,9987,10013,10047,10087,10114,13635,13645,13646,13758,13771,13783,13796,23410,23420,28179,28265,28274,28297,28344,34579,34590,34612,40280,40449,40570,40932,40938,40969,40983,41006,41040,41069,41096,41120,41214,41474,41478,42524,42533,42534,45569,45587,45598,56450,57574,57587]
plt.bar(pos, 1)
plt.show()
But this doesn't come close. What steps should I follow to complete the plot?
Desired plot:
With matplotlib, you could create the step plot as follows. where='post'
means the value changes at every x-position and stays so until the next x-position.
The x-values are the positions in the text, a zero is prepended to let the graph start with zero occurrences. The text-length is appended at the end. The y-values are the numbers 0, 1, 2, ...
, where the last value is repeated to draw the last step in full.
from matplotlib import pyplot as plt
from matplotlib.ticker import MultipleLocator, StrMethodFormatter
import numpy as np
pos = [2620,3367,3756,4522,4546,9914,9972,9979,9987,10013,10047,10087,10114,13635,13645,13646,13758,13771,13783,13796,23410,23420,28179,28265,28274,28297,28344,34579,34590,34612,40280,40449,40570,40932,40938,40969,40983,41006,41040,41069,41096,41120,41214,41474,41478,42524,42533,42534,45569,45587,45598,56450,57574,57587]
text_len = 92633
cum = np.arange(0, len(pos) + 1)
fig, ax = plt.subplots(figsize=(12, 3))
ax.step([0] + pos + [text_len], np.pad(cum, (0, 1), 'edge'), where='post', label=f'vous {len(pos)}')
ax.xaxis.set_major_locator(MultipleLocator(5000)) # x-ticks every 5000
ax.xaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}')) # use the thousands separator
ax.yaxis.set_major_locator(MultipleLocator(5)) # have a y-tick every 5
ax.grid(b=True, ls=':') # show a grid with dotted lines
ax.autoscale(enable=True, axis='x', tight=True) # disable padding x-direction
ax.set_xlabel(f'T={text_len:,d}')
ax.set_ylabel('Occurrences')
ax.set_title("Progression of 'vous' in TCN")
plt.legend() # add a legend (uses the label of ax.step)
plt.tight_layout()
plt.show()