Assume we have two data frames with columns as follows:
df1[['name', 'year', 'col1', 'col2', 'col3']]
df2[['name', 'year', 'col2', 'col3', 'col4']]
I want to do the merge of df1 and df2 by name
and year
with the condition to keep all value of col2
col3
on df1
, if it is None
then use value in df2
I know how to do this in traditional way by merging df1
and df2
then using ffill()
.
Since my process of cleaning data involve many steps of merging different df with same columns, it make the code not so clean when I keep have to using ffill()
and drop
columns. I don't know if pd.merge
has any built-in option like that?
Sample code:
df1 = pd.DataFrame({'name': ['a', 'a', 'b', 'b', 'c', 'c'],
'year': [2000, 2001, 2002, 2003, 2004, 2005],
'col1': [1,2,3,4,5,6],
'col2': [0,2,4,6,8,None],
'col3': [1,3,5,7,None,9]})
df2 = pd.DataFrame({'name': ['b', 'b', 'c', 'c', 'd', 'd'],
'year': [2003, 2004, 2004, 2005, 2006, 2007],
'col2': [10,20,30,None,50,60],
'col3': [100,300,500,700,None,900],
'col4': [5,6,7,8,9,10]})
Input:
df1
name year col1 col2 col3
0 a 2000 1 0.00 1.00
1 a 2001 2 2.00 3.00
2 b 2002 3 4.00 5.00
3 b 2003 4 6.00 7.00
4 c 2004 5 8.00 NaN
5 c 2005 6 NaN 9.00
df2
name year col2 col3 col4
0 b 2003 10.00 100.00 5
1 b 2004 20.00 300.00 6
2 c 2004 30.00 500.00 7
3 c 2005 NaN 700.00 8
4 d 2006 50.00 NaN 9
5 d 2007 60.00 900.00 10
Output desired
name year col1 col2 col3 col4
0 a 2000 1.00 0.00 1.00 NaN
1 a 2001 2.00 2.00 3.00 NaN
2 b 2002 3.00 4.00 5.00 NaN
3 b 2003 4.00 6.00 7.00 5.00
4 b 2004 NaN 20.00 300.00 6.00
5 c 2004 5.00 8.00 500.00 7.00
6 c 2005 6.00 NaN 9.00 8.00
7 d 2006 NaN 50.00 NaN 9.00
8 d 2007 NaN 60.00 900.00 10.00
Assuming unique combinations of name/year, you could concat
and groupby.first
:
out = pd.concat([df1, df2]).groupby(['name', 'year'], as_index=False).first()
For a more generic merge, you could perform two merges, excluding the common, non-key, columns, then combine_first
:
cols = ['name', 'year']
common = df1.columns.intersection(df2.columns).difference(cols)
out = (df1.merge(df2.drop(columns=common), on=cols, how='outer')
.combine_first(df1.drop(columns=common).merge(df2, on=cols, how='outer'))
)
Another option with a single merge
:
cols = ['name', 'year']
common = df1.columns.intersection(df2.columns).difference(cols)
out = df1.merge(df2, on=cols, suffixes=(None, '_right'), how='outer')
tmp = out.filter(regex='_right$')
out[common] = out[common].fillna(tmp.set_axis(common, axis=1))
out.drop(columns=tmp.columns, inplace=True)
And finally with a groupby.first
:
out = (df1.merge(df2, on=cols, suffixes=(None, '_right'), how='outer')
.rename(columns=lambda x: x.removesuffix('_right'))
.groupby(level=0, axis=1, sort=False).first()
)
# variant for recent pandas versions:
out = (df1.merge(df2, on=cols, suffixes=(None, '_right'), how='outer')
.rename(columns=lambda x: x.removesuffix('_right'))
.T.groupby(level=0, sort=False).first().T
)
Output:
name year col1 col2 col3 col4
0 a 2000 1.0 0.0 1.0 NaN
1 a 2001 2.0 2.0 3.0 NaN
2 b 2002 3.0 4.0 5.0 NaN
3 b 2003 4.0 6.0 7.0 5.0
4 b 2004 NaN 20.0 300.0 6.0
5 c 2004 5.0 8.0 500.0 7.0
6 c 2005 6.0 NaN 9.0 8.0
7 d 2006 NaN 50.0 NaN 9.0
8 d 2007 NaN 60.0 900.0 10.0