pythonpandasdataframe

Pandas dataframe - combine cell values as strings


I have a dataframe:

Email                | Col1      | Col2 | Col3      | Name
--------------------------------------------------------------------
john.cena@gmail.com  | CellStr11 | 1.4  | CellStr13 | John Cena
damian.doe@gmail.com | CellStr11 | 1.2  | CellStr13 | Matt Smith
john.smith@gmail.com | CellStr21 | 1.2  | CellStr23 | John Cena

I need to aggregate the values of cells of Col1, Col2 and Col3 by Name.

If row[Name] == row[Name] then:

Email                | Col1                  | Col2      | Col3                  | Name
------------------------------------------------------------------------------------------
john.cena@gmail.com  | CellStr11 / CellStr21 | 1.4 / 1.2 | CellStr13 / CellStr23 | John Cena
damian.doe@gmail.com | CellStr11             | 1.2       | CellStr13             | Matt Smith

I don't have to care about any other columns, their data can be lost. First I though to about this apporach:

  1. Split the dataframe in such a way that I get the other Name matching row (learnt how to do that here
  2. Append the columns to the end of the original dataframe

The issue here was that not all rows had the corresponding row with matching Name, which would result in some rows having the additional columns, some not (correct me if I'm wrong, if this apporach is ok that would also be fine, even better).

Is there a simple way to do it with df.apply and lambda function? Or any other more complex way? I could use: g2[['B', 'C']].apply(lambda x: x / x.sum()) but instead of summing just join as strings, but how can I differentiate between x.B and x.C?


Solution

  • If your dataframe is df:

    df['Col2'] = df['Col2'].astype('str')  # all the columns must be strings
    gp_col = df.groupby(["Name"])[['Col1', 'Col2', 'Col3']] \
               .agg(lambda x: " / ".join(x)) \
               .reset_index()
    

    display(gp_col) gives:

    result