pythonpandasmergeouter-join

Pandas merge dataframe with the same columns and one one varying


Probably already asked before, buy I cannot find it even after 30 mins of searching.

I have two pandas dataframes with the same columns. The values match except for one column and I want to perform a full outer join, where I get both values if both are there and only one value if one of them is present. There are many matching columns, so I would prefer a solution where I do not have to apply something for each matching column.

Example All columns are the same if the value is in both df, only the frequency varies:

  Gene  GeneID  Frequency
0   AA       1         10
1   BB       2         15
2   CC       3         12
  Gene  GeneID  Frequency
0   AA       1         20
1   DD       4         29

Code:

import pandas as pd


t1 = [{"Gene": "AA", "GeneID": "1" , "Frequency": 10},
{"Gene": "BB", "GeneID": "2" , "Frequency": 15},
{"Gene": "CC", "GeneID": "3" , "Frequency": 12}]


t2 = [{"Gene": "AA", "GeneID": "1" , "Frequency": 20},
{"Gene": "DD", "GeneID": "4" , "Frequency": 29}]

f1 = pd.DataFrame(t1)
f2 = pd.DataFrame(t2)

m = pd.merge(f1,f2,on=['Gene','Gene'],how='outer')

Results in:

  Gene GeneID_x  Frequency_x GeneID_y  Frequency_y
0   AA        1         10.0        1         20.0
1   BB        2         15.0      NaN          NaN
2   CC        3         12.0      NaN          NaN
3   DD      NaN          NaN        4         29.0

Now the ID is either in GeneID_x or or GeneID_y. I would like the following:

  Gene GeneID  Frequency_x   Frequency_y
0   AA        1         10.0         20.0
1   BB        2         15.0          NaN
2   CC        3         12.0          NaN
3   DD        4         NaN          29.0

Of course I can iterate and fill the GeneID where needed, but there are many more columns that match. There has to be a better solution. I also tried concat with group by and aggregate. This works, however I cannot see if the frequency comes from the first or second df if there is only one value.

Thanks.


Solution

  • You can use:

    m = (pd.merge(f1, f2, on='Gene', how='outer')
           .rename(columns={'GeneID_x': 'GeneID'})
           .assign(GeneID=lambda x: x['GeneID'].fillna(x.pop('GeneID_y'))))
    

    Output:

    >>> m
      Gene GeneID  Frequency_x  Frequency_y
    0   AA      1         10.0         20.0
    1   BB      2         15.0          NaN
    2   CC      3         12.0          NaN
    3   DD      4          NaN         29.0
    

    If (Gene, GeneID) always represent a unique combination, you can use:

    m = pd.merge(f1, f2, on=['Gene', 'GeneID'], how='outer')