I have a document with the following structure:
CUSTOMERID1
conversation-id-123
conversation-id-123
conversation-id-123
CUSTOMERID2
conversation-id-456
conversation-id-789
I'd like to parse the document to get a frequency distribution plot with the number of conversations on the X axis and the # of customers on the Y axis. Does anyone know the easiest way to do this with Python?
I'm familiar with the frequency distribution plot piece but am struggling with how to parse the data into the right data structure to build the plot. Thank you for any help you can provide ahead of time!
You can try the following:
>>> dict_ = {}
>>> with open('file.csv') as f:
for line in f:
if line.startswith('CUSTOMERID'):
dict_[line.strip('\n')] = list_ = []
else:
list_.append(line.strip().split('-'))
>>> df = pd.DataFrame.from_dict(dict_, orient='index').stack()
>>> df.transform(lambda x:x[-1]).groupby(level=0).count().plot(kind='bar')
Output:
If you want only 1
and 2
in X
axis, just change dict_[line.strip('\n')] = list_ = []
this line to dict_[line.strip('CUSTOMERID/\n')] = list_ = []
.