pythonparsingfrequency-distribution

Parsing Nested Row Text Document for Frequency Distribution Plot with Python


I have a document with the following structure:

CUSTOMERID1
    conversation-id-123
    conversation-id-123
    conversation-id-123
CUSTOMERID2
    conversation-id-456
    conversation-id-789

I'd like to parse the document to get a frequency distribution plot with the number of conversations on the X axis and the # of customers on the Y axis. Does anyone know the easiest way to do this with Python?

I'm familiar with the frequency distribution plot piece but am struggling with how to parse the data into the right data structure to build the plot. Thank you for any help you can provide ahead of time!


Solution

  • You can try the following:

    
    >>> dict_ = {}
        
    >>> with open('file.csv') as f:
            for line in f:
                if line.startswith('CUSTOMERID'):
                    dict_[line.strip('\n')] = list_ = []
                else:
                    list_.append(line.strip().split('-'))
        
    >>> df = pd.DataFrame.from_dict(dict_, orient='index').stack()
    >>> df.transform(lambda x:x[-1]).groupby(level=0).count().plot(kind='bar')
    

    Output:

    enter image description here

    If you want only 1 and 2 in X axis, just change dict_[line.strip('\n')] = list_ = [] this line to dict_[line.strip('CUSTOMERID/\n')] = list_ = [].