determine which columns are equal after pandas merge betwen two dataframes

I performed a merge with pandas using the suffix option like below:

df3 = df1.merge(df2, how='inner',on='key',suffixes=('_first', '_second'))

I now need to:

pairwise check i.e. (x_first == x_second ) for the approx 60 pairs of columns
If the columns are equal rename x_first to just x and drop x_second
If they are not equal keep both columns

How can this be done for a moderately large pandas dataframe (~6M rows by 200 columns )?

Solution

Example Code

You need to provide examples, not just describe what you want.

import pandas as pd

data1 = {'key': [1, 2, 3, 4, 5],
         'col1': [10, 20, 30, 40, 50],
         'col2': [10, None, 30, 40, 50],
         'col3': ['a', 'b', 'c', 'd', 'e'], 
         'col4': [1, 2, 3, 4, 5]}
data2 = {'key': [5, 2, 3, 4, 1],
         'col1': [50, 20, 30, 40, 10],
         'col2': [50, None, 30, 40, 10],
         'col3': ['a', 'b', 'c', 'd', 'x'], 
         'col4': [4, 5, 6, 7, 8], 
         'col5': [4, 5, 6, 7, 8]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

Code

You can do this by finding the columns whose values are common to the key and specifying on with the key. Since you may also be dealing with columns with NaNs, I used the function compare to find columns with different values and exclude them from the common columns.

# find common columns between df1 and df2
cols = df1.columns.intersection(df2.columns)

# Use 'key' as the index, compare values, and find columns with differences
cols_diff = (
    df1[cols]
    .set_index('key')
    .compare(df2[cols].set_index('key').reindex(df1['key']))
    .columns
    .get_level_values(0)
)

# exclude columns with differences to find columns with identical values
cols_on = cols.difference(cols_diff).tolist()

# df1 and df2 using 'key' and columns with identical values
out = df1.merge(df2, how='inner', on=cols_on, suffixes=('_first', '_second'))

out: