pythonpandas

Pandas Merge - Elegant way to deal with same columns


Assume we have two data frames with columns as follows:

df1[['name', 'year', 'col1', 'col2', 'col3']]
df2[['name', 'year', 'col2', 'col3', 'col4']]

I want to do the merge of df1 and df2 by name and year with the condition to keep all value of col2 col3 on df1, if it is None then use value in df2

I know how to do this in traditional way by merging df1 and df2 then using ffill().

Since my process of cleaning data involve many steps of merging different df with same columns, it make the code not so clean when I keep have to using ffill() and drop columns. I don't know if pd.merge has any built-in option like that?

Sample code:

df1 = pd.DataFrame({'name': ['a', 'a', 'b', 'b', 'c', 'c'],
                    'year': [2000, 2001, 2002, 2003, 2004, 2005],
                    'col1': [1,2,3,4,5,6],
                    'col2': [0,2,4,6,8,None],
                    'col3': [1,3,5,7,None,9]})

df2 = pd.DataFrame({'name': ['b', 'b', 'c', 'c', 'd', 'd'],
                    'year': [2003, 2004, 2004, 2005, 2006, 2007],
                    'col2': [10,20,30,None,50,60],
                    'col3': [100,300,500,700,None,900],
                    'col4': [5,6,7,8,9,10]})

Input:

df1

  name  year  col1  col2  col3
0    a  2000     1  0.00  1.00
1    a  2001     2  2.00  3.00
2    b  2002     3  4.00  5.00
3    b  2003     4  6.00  7.00
4    c  2004     5  8.00   NaN
5    c  2005     6   NaN  9.00

df2

  name  year  col2   col3  col4
0    b  2003 10.00 100.00     5
1    b  2004 20.00 300.00     6
2    c  2004 30.00 500.00     7
3    c  2005   NaN 700.00     8
4    d  2006 50.00    NaN     9
5    d  2007 60.00 900.00    10

Output desired

  name  year  col1  col2   col3  col4
0    a  2000  1.00  0.00   1.00   NaN
1    a  2001  2.00  2.00   3.00   NaN
2    b  2002  3.00  4.00   5.00   NaN
3    b  2003  4.00  6.00   7.00  5.00
4    b  2004   NaN 20.00 300.00  6.00
5    c  2004  5.00  8.00 500.00  7.00
6    c  2005  6.00   NaN   9.00  8.00
7    d  2006   NaN 50.00    NaN  9.00
8    d  2007   NaN 60.00 900.00 10.00

Solution

  • Assuming unique combinations of name/year, you could concat and groupby.first:

    out = pd.concat([df1, df2]).groupby(['name', 'year'], as_index=False).first()
    

    For a more generic merge, you could perform two merges, excluding the common, non-key, columns, then combine_first:

    cols = ['name', 'year']
    common = df1.columns.intersection(df2.columns).difference(cols)
    
    out = (df1.merge(df2.drop(columns=common), on=cols, how='outer')
           .combine_first(df1.drop(columns=common).merge(df2, on=cols, how='outer'))
          )
    

    Another option with a single merge:

    cols = ['name', 'year']
    common = df1.columns.intersection(df2.columns).difference(cols)
    
    out = df1.merge(df2, on=cols, suffixes=(None, '_right'), how='outer')
    tmp = out.filter(regex='_right$')
    
    out[common] = out[common].fillna(tmp.set_axis(common, axis=1))
    
    out.drop(columns=tmp.columns, inplace=True)
    

    And finally with a groupby.first:

    out = (df1.merge(df2, on=cols, suffixes=(None, '_right'), how='outer')
              .rename(columns=lambda x: x.removesuffix('_right'))
              .groupby(level=0, axis=1, sort=False).first()
          )
    
    # variant for recent pandas versions:
    out = (df1.merge(df2, on=cols, suffixes=(None, '_right'), how='outer')
              .rename(columns=lambda x: x.removesuffix('_right'))
              .T.groupby(level=0, sort=False).first().T
          )
    

    Output:

      name  year  col1  col2   col3  col4
    0    a  2000   1.0   0.0    1.0   NaN
    1    a  2001   2.0   2.0    3.0   NaN
    2    b  2002   3.0   4.0    5.0   NaN
    3    b  2003   4.0   6.0    7.0   5.0
    4    b  2004   NaN  20.0  300.0   6.0
    5    c  2004   5.0   8.0  500.0   7.0
    6    c  2005   6.0   NaN    9.0   8.0
    7    d  2006   NaN  50.0    NaN   9.0
    8    d  2007   NaN  60.0  900.0  10.0