pythonpandasdataframepandas-groupbypandas-merge

Identify columns which cause non match between two dataframes


I compare two DataFrames (df1 and df2) each of which shall have unique rows for a given combination of keys. That means there are not duplicates for keys such as Col1 and Col2.

import pandas as pd
                        #    OK   NOK   NOK
df1 = pd.DataFrame({'Col1': ['A', 'A', 'B'],
                    'Col2': ['J', 'K', 'M'],
                    'Col3': ['1', '2', '3'],
                    'Col4': ['UA', 'RU', 'EU']
                   })
                        #    OK   NOK  NOK single
df2 = pd.DataFrame({'Col1': ['A', 'A', 'B', 'C'],
                    'Col2': ['J', 'L', 'N', 'O'],
                    'Col3': ['1', '2', '3', '4'],
                    'Col4': ['UA', 'RU', 'non-EU', 'CN']
                   })

matching_key = ['Col1', 'Col3']

Firstly, I naively identify which rows are not matching

non = df1.merge(df2, indicator=True, how='outer').query('_merge != "both"')
non['_merge'] = (non['_merge'].str.replace('left_only', 'df1').str.replace('right_only', 'df2'))

non = pd.DataFrame({'Col1': ['A', 'A', 'B', 'B', 'C'],
                    'Col2': ['K', 'L', 'M', 'N', 'O'],
                    'Col3': ['2', '2', '3', '3', '4'],
                    'Col4': ['RU', 'RU', 'EU', 'non-EU', 'CN'],
                  '_merge': ['df1', 'df2', 'df1', 'df2', 'df2']
                          })

Then I want to find out what column/s cause this mis-match, considering the uniqueness keys (?).

So for instance, to know that Col2 causes mismatch with keys Col1 = 'A' and Col3 = '2', and [Col2, Col4] cause mismatch with keys Col1 = 'B' and Col3 = '3'. Something like

result = {['Col2']: pd.DataFrame({'Col1': ['A', 'A'],
                                  'Col3': ['2', '2'],
                                  'Col2': ['K', 'L']
                                  }),
  ['Col2', 'Col4']: pd.DataFrame({'Col1': ['B', 'B'],
                                  'Col3': ['3', '3'],
                                  'Col2': ['M', 'N'],
                                  'Col4': ['EU', 'non-EU']
                                  })
          }

I am open to any way that helps to quickly get an overview of columns causing mismatch.

One can then tackle issues from least complex (only one column) to more complex (more columns).


Solution

  • Building on the merged DataFrame that you've built, we could use groupby + transform(size) to create a boolean mask to filter the rows where matching_key appears more than once (this is used to filter out keys that exist in only one DataFrame).

    Then build a desired dictionary using a groupby + dict comprehension where we define the keys as the column names that don't share a value across the two DataFrames (using nunique).

    merged = df1.merge(df2, indicator=True, how='outer').query("_merge!='both'")
    tmp = merged[merged.groupby(matching_key)['_merge'].transform('size') > 1].drop(columns='_merge')
    out = {tuple(df0.columns[df0.nunique()>1]) : df0 for _, df0 in tmp.groupby(matching_key)}
    

    Output:

    {
     ('Col2',):
     Col1 Col2 Col3 Col4
     1    A    K    2   RU
     3    A    L    2   RU,
    
     ('Col2', 'Col4'):
     Col1 Col2 Col3    Col4
     2    B    M    3      EU
     4    B    N    3  non-EU
    }