python-3.xpandas

Collapse values from multiple rows of a column into an array when all other columns values are same


I have a table with 7 columns where for every few rows, 6 columns remain same and only the 7th changes. I would like to merge all these rows into one row, and combine the value of the 7th column into a list.

So if I have this dataframe:

   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6
3  c  7  6

I would like to convert it to this:

   A       B  C
0  a       1  2
1  b       3  4
2  c  [5, 7]  6

Since the values of column A and C were same in row 2 and 3, they would get collapsed into a single row and the values of B will be combined into a list.

Melt, explode, and pivot don't seem to have such functionality. How can achieve this using Pandas?


Solution

  • Use GroupBy.agg with custom lambda function, last add DataFrame.reindex for same order of columns by original:

    f = lambda x: x.tolist() if len(x) > 1 else x
    df = df.groupby(['A','C'])['B'].agg(f).reset_index().reindex(df.columns, axis=1)
    

    You can also create columns names dynamic like:

    changes = ['B']
    cols = df.columns.difference(changes).tolist()
    
    f = lambda x: x.tolist() if len(x) > 1 else x
    df = df.groupby(cols)[changes].agg(f).reset_index().reindex(df.columns, axis=1)
    print (df)
       A       B  C
    0  a       1  2
    1  b       3  4
    2  c  [5, 7]  6
    

    For all lists in column solution is simplier:

    changes = ['B']
    cols = df.columns.difference(changes).tolist()
    
    df = df.groupby(cols)[changes].agg(list).reset_index().reindex(df.columns, axis=1)
    print (df)
       A       B  C
    0  a     [1]  2
    1  b     [3]  4
    2  c  [5, 7]  6