I am having two dataframes which have exact same data structure. I need to compare them to see if they have any difference in records due to any column value being different.
I am using below code to do it and it works perfectly to report if things tie or untie between these two dataframes.
df=pd.concat([df1, df2])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
if df.reindex(idx).empty:
print('everything is good.')
else:
print('things do not tie out')
df.reindex(idx).to_csv('diff.csv', index=False)
Though diff.csv
tells me what all is missing or is different, what it doesn't tell is which record belonged to which dataframe initially and which column values differ between the initial dataframes for a given record. Is there a way to somehow get this information in my final output ?
Sample dataframes.
Name | Age| Gender
0| Naxi | 27 | Male
1| Karan| 25 | Male
2| Tanya| 27 | Female
Name | Age| Gender
0| Naxi | 27 | Male
1| Tanya| 27 | Female
2| Karan| 24 | Male
output I want
Name | Age| Gender | Dataframe
Karan| 24 | Male | df2
Karan| 25 | Male | df1
You can add 1 column to each dataframe and then ignore that column while dropping duplicates (after pd.concat).
df1['Dataframe'] = 'df1'
df2['Dataframe'] = 'df2'
df=pd.concat([df1, df2])
diff_df = df.drop_duplicates(subset=['Name', 'Age', 'Gender'], keep=False)
print(diff_df)
Output -
Name Age Gender Dataframe
2 Karan 24 Male df1
1 Karan 25 Male df2
Index in the output will help you to locate the correct row in the initial dataframe.