I performed a merge with pandas using the suffix option like below:
df3 = df1.merge(df2, how='inner',on='key',suffixes=('_first', '_second'))
I now need to:
x_first == x_second
) for the approx 60 pairs of columnsx_first
to just x
and drop x_second
How can this be done for a moderately large pandas dataframe (~6M rows by 200 columns )?
Example Code
You need to provide examples, not just describe what you want.
import pandas as pd
data1 = {'key': [1, 2, 3, 4, 5],
'col1': [10, 20, 30, 40, 50],
'col2': [10, None, 30, 40, 50],
'col3': ['a', 'b', 'c', 'd', 'e'],
'col4': [1, 2, 3, 4, 5]}
data2 = {'key': [5, 2, 3, 4, 1],
'col1': [50, 20, 30, 40, 10],
'col2': [50, None, 30, 40, 10],
'col3': ['a', 'b', 'c', 'd', 'x'],
'col4': [4, 5, 6, 7, 8],
'col5': [4, 5, 6, 7, 8]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Code
You can do this by finding the columns whose values are common to the key and specifying on with the key. Since you may also be dealing with columns with NaNs, I used the function compare
to find columns with different values and exclude them from the common columns.
# find common columns between df1 and df2
cols = df1.columns.intersection(df2.columns)
# Use 'key' as the index, compare values, and find columns with differences
cols_diff = (
df1[cols]
.set_index('key')
.compare(df2[cols].set_index('key').reindex(df1['key']))
.columns
.get_level_values(0)
)
# exclude columns with differences to find columns with identical values
cols_on = cols.difference(cols_diff).tolist()
# df1 and df2 using 'key' and columns with identical values
out = df1.merge(df2, how='inner', on=cols_on, suffixes=('_first', '_second'))
out: