pythonpandasdataframediff

Smarter way to create diff between two pandas dataframes?


I have two pandas dataframes which represent a directory structure with file hashes like

import pandas as pd

dir_old = pd.DataFrame([
    {"Filepath": "dir1/file1", "Hash": "hash1"},
    {"Filepath": "dir1/file2", "Hash": "hash2"},
    {"Filepath": "dir2/file3", "Hash": "hash3"},
])

dir_new = pd.DataFrame([
    # {"Filepath": "dir1/file1", "Hash": "hash1"}, # deleted file
    {"Filepath": "dir1/file2", "Hash": "hash2"},
    {"Filepath": "dir2/file3", "Hash": "hash5"},  # changed file
    {"Filepath": "dir1/file4", "Hash": "hash4"},  # new file
])

The dir_new shows the content of the directory structure after some changes. To compare these two dataframes I use

df_merged = pd.merge(dir_new, dir_old, how='outer', indicator=True)
print(df_merged)

This will return

     Filepath   Hash      _merge
0  dir1/file1  hash1  right_only
1  dir1/file2  hash2        both
2  dir1/file4  hash4   left_only
3  dir2/file3  hash3  right_only
4  dir2/file3  hash5   left_only

It is easy to identify the right_only rows as deleted, both as unchanged and left_only as new files. However what to do about the modified file dir/file3 which appears twice as right_only and left_only? I did the following:

# The indicator columns _merge has categorical values. 
# We need to convert it to string to be able to add later a new value `modified`
df_merged["State"] = df_merged["_merge"].astype(str)
df_merged = df_merged.drop(columns=["_merge"])

# Identify the rows with duplicated filepath and only keep the new (left_only) ones
modified = df_merged[df_merged.duplicated(subset=["Filepath"], keep=False)]
keep = modified[modified["State"] == "left_only"]
drop = modified[modified["State"] == "right_only"]

# Rename the state of the new modified files to `changed` and drop the old duplicated row
df_merged.iloc[keep.index, df_merged.columns.get_loc("State")] = "changed"
df_dropped = df_merged.drop(drop.index)

# Finally rename the State for all the remaining rows
df_final = df_dropped.replace(to_replace=["right_only", "left_only", "both"],
                              value=["deleted", "created", "equal"]).reset_index(drop=True)

print(df_final) 

The output is

     Filepath   Hash    State
0  dir1/file1  hash1  deleted
1  dir1/file2  hash2    equal
2  dir1/file4  hash4  created
3  dir2/file3  hash5  changed

So it works. But is strikes me as a very complicated solution. Is there maybe a smarter way create a diff between these two dataframes and especially to identify the modified rows between dir_old and dir_new ?


Solution

  • I'd do it first by merging only by Filepath and then compare Hash_x/Hash_y and indicator accordingly (seems straightforward to me):

    df = dir_new.merge(dir_old, on="Filepath", how="outer", indicator=True)
    
    hash_changed = df["Hash_x"] != df["Hash_y"]
    
    deleted = df["_merge"] == "right_only"
    created = df["_merge"] == "left_only"
    changed = (df["_merge"] == "both") & hash_changed
    unchanged = (df["_merge"] == "both") & ~hash_changed
    
    df.loc[deleted, "Status"] = "deleted"
    df.loc[created, "Status"] = "created"
    df.loc[changed, "Status"] = "changed"
    df.loc[unchanged, "Status"] = "unchanged"
    
    df["Hash"] = df[["Hash_x", "Hash_y"]].bfill(axis=1)["Hash_x"]
    
    print(df[["Filepath", "Hash", "Status"]])
    

    Prints:

         Filepath   Hash     Status
    0  dir1/file1  hash1    deleted
    1  dir1/file2  hash2  unchanged
    2  dir1/file4  hash4    created
    3  dir2/file3  hash5    changed
    

    EDIT: Example of dynamic column names:

    df = dir_new.merge(dir_old, on="Filepath", how="outer", indicator=True)
    column_names = dir_new.columns.difference(["Filepath"])
    
    hash_changed = df["Hash_x"] != df["Hash_y"]
    
    deleted = df["_merge"] == "right_only"
    created = df["_merge"] == "left_only"
    changed = (df["_merge"] == "both") & hash_changed
    unchanged = (df["_merge"] == "both") & ~hash_changed
    
    df.loc[deleted, "Status"] = "deleted"
    df.loc[created, "Status"] = "created"
    df.loc[changed, "Status"] = "changed"
    df.loc[unchanged, "Status"] = "unchanged"
    
    for c in column_names:
        df[c] = df[[f"{c}_x", f"{c}_y"]].bfill(axis=1)[f"{c}_x"]
    
    print(df[["Filepath", *column_names, "Status"]])