pythonpandas

determine which columns are equal after pandas merge betwen two dataframes


I performed a merge with pandas using the suffix option like below:

df3 = df1.merge(df2, how='inner',on='key',suffixes=('_first', '_second'))

I now need to:

  1. pairwise check i.e. (x_first == x_second ) for the approx 60 pairs of columns
  2. If the columns are equal rename x_first to just x and drop x_second
  3. If they are not equal keep both columns

How can this be done for a moderately large pandas dataframe (~6M rows by 200 columns )?


Solution

  • Example Code

    You need to provide examples, not just describe what you want.

    import pandas as pd
    
    data1 = {'key': [1, 2, 3, 4, 5],
             'col1': [10, 20, 30, 40, 50],
             'col2': [10, None, 30, 40, 50],
             'col3': ['a', 'b', 'c', 'd', 'e'], 
             'col4': [1, 2, 3, 4, 5]}
    data2 = {'key': [5, 2, 3, 4, 1],
             'col1': [50, 20, 30, 40, 10],
             'col2': [50, None, 30, 40, 10],
             'col3': ['a', 'b', 'c', 'd', 'x'], 
             'col4': [4, 5, 6, 7, 8], 
             'col5': [4, 5, 6, 7, 8]}
    
    df1 = pd.DataFrame(data1)
    df2 = pd.DataFrame(data2)
    

    example

    Code

    You can do this by finding the columns whose values are common to the key and specifying on with the key. Since you may also be dealing with columns with NaNs, I used the function compare to find columns with different values and exclude them from the common columns.

    # find common columns between df1 and df2
    cols = df1.columns.intersection(df2.columns)
    
    # Use 'key' as the index, compare values, and find columns with differences
    cols_diff = (
        df1[cols]
        .set_index('key')
        .compare(df2[cols].set_index('key').reindex(df1['key']))
        .columns
        .get_level_values(0)
    )
    
    # exclude columns with differences to find columns with identical values
    cols_on = cols.difference(cols_diff).tolist()
    
    # df1 and df2 using 'key' and columns with identical values
    out = df1.merge(df2, how='inner', on=cols_on, suffixes=('_first', '_second'))
    

    out:

    enter image description here