python-3.xpandastuples

how to convert tuples in a column rows of a pandas dataframe into repeating rows and columns?


I have a dataframe which contains the following data (only 3 samples are provided here):

data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implementation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}


# Create DataFrame
df = pd.DataFrame(data)

Basically, each row contains a tuple of top 10 words along with their frequencies in each of Department.

I wanted to create a dataframe where (let department name get repeated and) each row contains the word from tuple in one column and frequency count in the other columns so that it should look like something this:

Department  Word            Counts
D1          cat             6
D1          project         6
D1          dog             6
D1          develop         4
D1          smooth          4
D1          efficient       4
D1          administrative  4
D1          procedure       4
D1          establishment   3
D1          matter          3
D2          management      21
D2          satisfaction    12
D2          within          9
D2          budget          9
D2          township        9

Is there any work around this type of conversion?


Solution

  • First, use DataFrame.explode to separate the list elements into different rows. Then split the tuples into different columns, e.g. using DataFrame.assign + Series.str

    res = (
        df.explode('TopWords', ignore_index=True)
          .assign(Word=lambda df: df['TopWords'].str[0], 
                  Counts=lambda df: df['TopWords'].str[1])
          .drop(columns='TopWords')
    )  
    

    Output:

    >>> res 
    
       Department             Word  Counts
    0          D1              cat       6
    1          D1          project       6
    2          D1              dog       6
    3          D1          develop       4
    4          D1           smooth       4
    5          D1        efficient       4
    6          D1   administrative       4
    7          D1        procedure       4
    8          D1    establishment       3
    9          D1           matter       3
    10         D2       management      21
    11         D2     satisfaction      12
    12         D2           within       9
    13         D2           budget       9
    14         D2         township       9
    15         D2             site       9
    16         D2         periodic       9
    17         D2            admin       9
    18         D2         maintain       9
    19         D2            guest       6
    20         D3           manage       2
    21         D3               ir       2
    22         D3            mines       2
    23         D3   implementation       2
    24         D3             clrg       2
    25         D3              act       2
    26         D3  implementations       2
    27         D3           office       2
    28         D3      maintenance       2
    29         D3   administration       2
    

    As @sammywemmy suggested, if you are dealing with a considerable amount of data, it will be faster if you wrangle it before loading it into a DataFrame.

    Another way of doing it using a nested loop

    data = {'Department' : ['D1', 'D2', 'D3'],
    'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
    [('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
    [('manage', 2), ('ir', 2), ('mines', 2), ('implementation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
    
    records = []
    for idx, top_words_list in enumerate(data['TopWords']):
        for word, count in top_words_list:
            rec = {
                'Department': data['Department'][idx],
                'Word': word,
                'Count': count
            }
            records.append(rec)
            
    res = pd.DataFrame(records)