I have 2 dataframes that captured the hierarchy of the same dataset. Df1 is more complete compared to Df2, so I want to use Df1 as the standard to analyze if the hierarchy in Df2 is correct. However, both dataframes show the hierarchy in a bad way so it's hard to know the complete structure row by row.
Eg. Company A may have subsidiary: B, C, D, E and the relationship is A owns B owns C owns D owns E. In Df1, it may show:
| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A | B | C |
| B | C | D | --> new
| C | D | E |
So if you break down to analyze row by row, the same entity can be shown as "Ultimate Parent" or "Child" at the same time, which makes it complicated.
On the other hand, as Df2 is incomplete, so it won't have all the data (A, B, C, D, E). It will only contain partial data, eg. A, D, E in this case, so the dataframe will look like this
| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A | D | E |
Now I want to (1) use Df1 to get the correct/complete hierarchy (2) compare and identify the gap between Df1 and Df2. The logic is as following:
If A owns B owns C owns D owns E and Df1 looks like this
| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A | B | C |
| C | D | E |
I want to add 1 column to put all the related entities together and in order from ultimate parent to child
| Ultimate Parent | Parent | Child | Hierarchy |
| --------------- | ------ |-------|-------------|
| A | B | C |A, B, C, D, E|
| C | D | E |A, B, C, D, E|
And then compare this Df1 with Df2 and add a column to Df2 to identify the gap. The most ideal (but optional) situation is to have another column stating the reason if it's wrong.
| Ultimate Parent | Parent | Child | Right/Wrong| Reason |
| --------------- | ------ |-------|------------|-----------------|
| A | D | E | Right | |
| C | B | A | Wrong | wrong hierarchy |
| C | A | B | Wrong | wrong hierarchy | --> new
| G | A | B | Wrong | wrong entities | --> new
| A | F | G | Wrong | wrong entities |
I have tried multiple string matching methods, but I'm stuck in the step and idea where I think order matters but I don't know how to compare strings in order when they're related but scattered in different rows.
Basically, you'll need to build a network graph of df1 to get a comprejhension map of the hierarchies. Once this is done, you need to compare the hierarchies of df2 with those of df1 and finally validate. To do so, you can define function. You'll create a new column hierarchies
to df1 and Right/Wrong
, Reason
to df2. .
import pandas as pd
import networkx as nx
data1 = {
'Ultimate Parent': ['A', 'C'],
'Parent': ['B', 'D'],
'Child': ['C', 'E']
}
df1 = pd.DataFrame(data1)
data2 = {
'Ultimate Parent': ['A', 'C', 'A'],
'Parent': ['D', 'B', 'F'],
'Child': ['E', 'A', 'G']
}
df2 = pd.DataFrame(data2)
G = nx.DiGraph()
for _, row in df1.iterrows():
G.add_edge(row['Parent'], row['Child'])
if row['Ultimate Parent'] != row['Parent']:
G.add_edge(row['Ultimate Parent'], row['Parent'])
def complete_hierarchy(node, graph):
descendants = nx.descendants(graph, node)
descendants.add(node)
return ', '.join(sorted(descendants))
df1['Hierarchy'] = df1['Ultimate Parent'].apply(lambda x: complete_hierarchy(x, G))
def validate_row(row, hierarchy_df, graph):
filtered_hierarchy = hierarchy_df[hierarchy_df['Ultimate Parent'] == row['Ultimate Parent']]
if filtered_hierarchy.empty:
return pd.Series(["Wrong", "wrong entities"])
full_hierarchy = filtered_hierarchy.iloc[0]['Hierarchy']
hierarchy_elements = set(full_hierarchy.split(', '))
if set([row['Parent'], row['Child']]).issubset(graph.nodes()):
if row['Parent'] not in hierarchy_elements or row['Child'] not in hierarchy_elements:
return pd.Series(["Wrong", "wrong hierarchy"])
elif f"{row['Parent']}, {row['Child']}" not in full_hierarchy:
return pd.Series(["Wrong", "wrong hierarchy"])
else:
return pd.Series(["Right", ""])
else:
return pd.Series(["Wrong", "wrong entities"])
df2[['Right/Wrong', 'Reason']] = df2.apply(lambda row: validate_row(row, df1, G), axis=1)
print("Df1 - Complete Hierarchy:")
print(df1)
print("\nDf2 - Validation Results:")
print(df2)
Which gives you
Df1 - Complete Hierarchy:
Ultimate Parent Parent Child Hierarchy
0 A B C A, B, C, D, E
1 C D E C, D, E
Df2 - Validation Results:
Ultimate Parent Parent Child Right/Wrong Reason
0 A D E Right
1 C B A Wrong wrong hierarchy
2 A F G Wrong wrong entities