pythonpandas

comparing two dataframes and finding what is missing from each dataframe in Python


I am having two dataframes which have exact same data structure. I need to compare them to see if they have any difference in records due to any column value being different.

I am using below code to do it and it works perfectly to report if things tie or untie between these two dataframes.

df=pd.concat([df1, df2])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
if df.reindex(idx).empty:
    print('everything is good.')
else:
    print('things do not tie out')
    df.reindex(idx).to_csv('diff.csv', index=False)
    

Though diff.csv tells me what all is missing or is different, what it doesn't tell is which record belonged to which dataframe initially and which column values differ between the initial dataframes for a given record. Is there a way to somehow get this information in my final output ?

Sample dataframes.

   Name | Age| Gender
0| Naxi | 27 | Male
1| Karan| 25 | Male
2| Tanya| 27 | Female


   Name | Age| Gender
0| Naxi | 27 | Male
1| Tanya| 27 | Female
2| Karan| 24 | Male

output I want

   Name | Age| Gender | Dataframe
   Karan| 24 | Male   | df2
   Karan| 25 | Male   | df1

Solution

  • You can add 1 column to each dataframe and then ignore that column while dropping duplicates (after pd.concat).

    df1['Dataframe'] = 'df1'
    df2['Dataframe'] = 'df2'
    df=pd.concat([df1, df2])
    diff_df =  df.drop_duplicates(subset=['Name', 'Age', 'Gender'], keep=False)
    print(diff_df)
    

    Output -

        Name  Age Gender Dataframe
    2  Karan   24   Male       df1
    1  Karan   25   Male       df2
    

    Index in the output will help you to locate the correct row in the initial dataframe.