pythonpandasyamldiffpython-deepdiff

python pandas deepdiff two yaml files and printing mismatch values


I have two different Yaml files. One yaml file is just slightly different to another yaml file. I want to print two things (i.e) difference in first file when compared to second file and then difference in second file when compared to first file. Here, is my code below:-

import yaml
import pandas as pd
from deepdiff import DeepDiff

with open(r'C:\Users\Project\Desktop\DRsystem\stars4.yaml','r') as file:
    df1 = pd.io.json.json_normalize(yaml.load(file, Loader=yaml.FullLoader))

with open(r'C:\Users\Project\Desktop\DRsystem\stars5.yaml','r') as file:
    df2 = pd.io.json.json_normalize(yaml.load(file, Loader=yaml.FullLoader))

x = df1.to_dict()
print(x)
ddiff1 = DeepDiff(df1,df2)
print(ddiff1)
print("---------")
y = df2.to_dict()
print(y)
ddiff2 = DeepDiff(df2,df1)
print(ddiff2)

Output:The above code prints the difference however it displays what has been added new (i.e) anything which is not present completely in one of the yaml which is good but it does not print anything in common with a slight change. It is better understood if we can see the screenshots attached (both the yaml files along with my output attached)

query 1: why is it just printing {'root3': 'denmark.enabled'}} BUT not {0: True}}

query 2: canada is present in both the files but in one file it is enabled:true and another file it is enabled:false...so why is it not showing when I am doing a diff, that it is true in one file and false in another?

Yaml 1

Yaml 2

Output


Solution

  • query 1: why is it just printing {'root3': 'denmark.enabled'}} BUT not {0: True}}

    {0:True} is part of the output of the pandas function to_dict (meaning that row 0 has the value True). It has nothing to do with DeepDiff.

    query 2: canada is present in both the files but in one file it is enabled:true and another file it is enabled:false...so why is it not showing when I am doing a diff, that it is true in one file and false in another?

    Although deepdiff claims to find "differences of dictionaries, iterables, strings and other objects" it doesn't look deep into pandas dataframes. It just iterates over the dataframe which will yield the column headers. Hence it finds any differences in the column headers. Then it iterates over these headers - not over the values in the columns! - so it won't notice any changes in column values.

    A possible workaround is to compare the dictionary representations of the dataframes:

    print(DeepDiff(df1.to_dict(), df2.to_dict()))