pythonpandasxor

Get XOR between 2 dataframes


How to get difference between 2 pandas dataframes (symmetric difference)?

import pandas as pd
a = pd.DataFrame({'a': [1, 2], 'b': ['x', 'y']})
b = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'z', '']})
result = pd.DataFrame({'a': [2, 2, 3], 'b': ['y', 'z', ''], 'source': ['a', 'b', 'b']})

Visual

   a  b
0  1  x
1  2  y

   a  b
0  1  x
1  2  z
2  3   

Out[103]: 
   a  b source
0  2  y      a
1  2  z      b
2  3         b

Attempted solution seems too complicated

diff_a = pd.concat([a, b, b]).drop_duplicates(keep=False)
diff_a['source'] = 'a'

diff_b = pd.concat([b, a, a]).drop_duplicates(keep=False)
diff_b['source'] = 'b'

out = pd.concat([diff_a, diff_b]).reset_index(drop=True)

Solution

  • If your inputs don't have duplicates you could use a single concat/drop_duplicates step:

    out = (pd.concat([a, b], keys=['a', 'b'], names=['source'])
             .drop_duplicates(keep=False)
             .reset_index(0)
           )
    

    Output:

      source  a  b
    1      a  2  y
    1      b  2  z
    2      b  3