I have a dataframe:
Email | Col1 | Col2 | Col3 | Name
--------------------------------------------------------------------
john.cena@gmail.com | CellStr11 | 1.4 | CellStr13 | John Cena
damian.doe@gmail.com | CellStr11 | 1.2 | CellStr13 | Matt Smith
john.smith@gmail.com | CellStr21 | 1.2 | CellStr23 | John Cena
I need to aggregate the values of cells of Col1, Col2 and Col3 by Name.
If row[Name] == row[Name]
then:
Email | Col1 | Col2 | Col3 | Name
------------------------------------------------------------------------------------------
john.cena@gmail.com | CellStr11 / CellStr21 | 1.4 / 1.2 | CellStr13 / CellStr23 | John Cena
damian.doe@gmail.com | CellStr11 | 1.2 | CellStr13 | Matt Smith
I don't have to care about any other columns, their data can be lost. First I though to about this apporach:
The issue here was that not all rows had the corresponding row with matching Name, which would result in some rows having the additional columns, some not (correct me if I'm wrong, if this apporach is ok that would also be fine, even better).
Is there a simple way to do it with df.apply
and lambda function? Or any other more complex way? I could use:
g2[['B', 'C']].apply(lambda x: x / x.sum())
but instead of summing just join as strings, but how can I differentiate between x.B and x.C?
If your dataframe is df:
df['Col2'] = df['Col2'].astype('str') # all the columns must be strings
gp_col = df.groupby(["Name"])[['Col1', 'Col2', 'Col3']] \
.agg(lambda x: " / ".join(x)) \
.reset_index()
display(gp_col) gives: