Probably already asked before, buy I cannot find it even after 30 mins of searching.
I have two pandas dataframes with the same columns. The values match except for one column and I want to perform a full outer join, where I get both values if both are there and only one value if one of them is present. There are many matching columns, so I would prefer a solution where I do not have to apply something for each matching column.
Example All columns are the same if the value is in both df, only the frequency varies:
Gene GeneID Frequency
0 AA 1 10
1 BB 2 15
2 CC 3 12
Gene GeneID Frequency
0 AA 1 20
1 DD 4 29
Code:
import pandas as pd
t1 = [{"Gene": "AA", "GeneID": "1" , "Frequency": 10},
{"Gene": "BB", "GeneID": "2" , "Frequency": 15},
{"Gene": "CC", "GeneID": "3" , "Frequency": 12}]
t2 = [{"Gene": "AA", "GeneID": "1" , "Frequency": 20},
{"Gene": "DD", "GeneID": "4" , "Frequency": 29}]
f1 = pd.DataFrame(t1)
f2 = pd.DataFrame(t2)
m = pd.merge(f1,f2,on=['Gene','Gene'],how='outer')
Results in:
Gene GeneID_x Frequency_x GeneID_y Frequency_y
0 AA 1 10.0 1 20.0
1 BB 2 15.0 NaN NaN
2 CC 3 12.0 NaN NaN
3 DD NaN NaN 4 29.0
Now the ID is either in GeneID_x or or GeneID_y. I would like the following:
Gene GeneID Frequency_x Frequency_y
0 AA 1 10.0 20.0
1 BB 2 15.0 NaN
2 CC 3 12.0 NaN
3 DD 4 NaN 29.0
Of course I can iterate and fill the GeneID where needed, but there are many more columns that match. There has to be a better solution. I also tried concat with group by and aggregate. This works, however I cannot see if the frequency comes from the first or second df if there is only one value.
Thanks.
You can use:
m = (pd.merge(f1, f2, on='Gene', how='outer')
.rename(columns={'GeneID_x': 'GeneID'})
.assign(GeneID=lambda x: x['GeneID'].fillna(x.pop('GeneID_y'))))
Output:
>>> m
Gene GeneID Frequency_x Frequency_y
0 AA 1 10.0 20.0
1 BB 2 15.0 NaN
2 CC 3 12.0 NaN
3 DD 4 NaN 29.0
If (Gene, GeneID)
always represent a unique combination, you can use:
m = pd.merge(f1, f2, on=['Gene', 'GeneID'], how='outer')