I have seven dataframes with hundreds of rows each that I need to combine on a column. There are instances where these seven dataframes have columns with the same names. In those instances, I would like to combine the data therein and delimit with a semicolon.
For example, if Row 1 in DF1 through DF7 have the same identifier, I would like Col1 in each dataframe (given they have the same name) to be combined to read:
dfdata1; dfdata2; ...; dfdata7
In cases where a column name is unique, I'd like it to appear in the final combined dataframe.
I've included a simple example
data1 = pd.DataFrame(
[['Banana', 'Sally', 'CA'], ['Apple', 'Gretta', 'MN'], ['Orange', 'Samantha', 'NV']],
columns=['Product', 'Cashier', 'State']
)
data2 = pd.DataFrame(
[['Shirt','', 'CA'], ['Shoe', 'Trish', 'MN'], ['Socks', 'Paula', 'NM', 'Hourly']]
)
My expected output:
How do I do this?
Instead of merging, concatenate. Then combining strings can done using groupby.agg
.
# concatenate and groupby to join the strings
df = (
pd.concat([data1, data2])
.groupby('State', as_index=False)
.agg(lambda x: '; '.join(el for el in x if pd.notna(el)))
)
print(df)
State Product Cashier Type
0 CA Banana; Shirt Sally;
1 MN Apple; Shoe Gretta; Trish
2 NM Socks Paula Hourly
3 NV Orange Samantha