How to get difference between 2 pandas dataframes (symmetric difference)?
import pandas as pd
a = pd.DataFrame({'a': [1, 2], 'b': ['x', 'y']})
b = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'z', '']})
result = pd.DataFrame({'a': [2, 2, 3], 'b': ['y', 'z', ''], 'source': ['a', 'b', 'b']})
Visual
a b
0 1 x
1 2 y
a b
0 1 x
1 2 z
2 3
Out[103]:
a b source
0 2 y a
1 2 z b
2 3 b
Attempted solution seems too complicated
diff_a = pd.concat([a, b, b]).drop_duplicates(keep=False)
diff_a['source'] = 'a'
diff_b = pd.concat([b, a, a]).drop_duplicates(keep=False)
diff_b['source'] = 'b'
out = pd.concat([diff_a, diff_b]).reset_index(drop=True)
If your inputs don't have duplicates you could use a single concat
/drop_duplicates
step:
out = (pd.concat([a, b], keys=['a', 'b'], names=['source'])
.drop_duplicates(keep=False)
.reset_index(0)
)
Output:
source a b
1 a 2 y
1 b 2 z
2 b 3