I have a dataframe as the following, showing the relationship of different entities in each row.
Child | Parent | Ult_Parent | Full_Family |
---|---|---|---|
A032 | A001 | A039 | A001, A032, A039, A040, A041, A043, A043, A045, A046 |
In the "Full_Family" column, it's showing the correct hierarchy for the whole family tree from small to big, not just entities in that specific row. But to be aware of, not every entity in the "Full_Family" will appear in the "Child/Parent/Ult_Parent", as the "Full_Family" data is coming from the other source.
Here I have 2 problems need to solve:
Below is the example and ideal outcome:
Child | Parent | Ult_Parent | Full_Family | Correct_Order | Correct_Ult_Parent_per_Family |
---|---|---|---|---|---|
A032 | A001 | A039 | A001, A032, A039, A040, A041, A043, A043, A045, A046 | A001, A032, A039 | A043 |
A001 | A043 | A039 | A001, A032, A039, A040, A041, A043, A043, A045, A046 | A001, A039, A043 | A043 |
Although A046 is the ultimate parent in this family, but it does not appear in this dataframe's Child/Parent/Ult_Parent, so the correct ultimate parent per family is A043 in this case.
Appreciate the help.
IIUC, assuming this dataframe as input:
import pandas as pd
data = {
"Child": ["A032", "A001"],
"Parent": ["A001", "A043"],
"Ult_Parent": ["A039", "A039"],
"Full_Family": [
"A001, A032, A039, A040, A041, A043, A043, A045, A046",
"A001, A032, A039, A040, A041, A043, A043, A045, A046",
],
}
df = pd.DataFrame(data)
Child Parent Ult_Parent Full_Family
0 A032 A001 A039 A001, A032, A039, A040, A041, A043, A043, A045...
1 A001 A043 A039 A001, A032, A039, A040, A041, A043, A043, A045...
You can use this approach:
df["Correct_Order"] = df.apply(
lambda row: ", ".join(sorted([row["Parent"], row["Child"], row["Ult_Parent"]])),
axis=1,
)
df["Correct_Ult_Parent_per_Family"] = (
df[["Parent", "Child", "Ult_Parent"]].max(axis=1).max()
)
Child Parent Ult_Parent Full_Family Correct_Order Correct_Ult_Parent_per_Family
0 A032 A001 A039 A001, A032, A039, A040, A041, A043, A043, A045... A001, A032, A039 A043
1 A001 A043 A039 A001, A032, A039, A040, A041, A043, A043, A045... A001, A039, A043 A043
If 'Full_Family'
is not necessarily in ascending order, and you want to respect its order, you can define a custom key to sorted
.
For example, if A039
comes before A032
in 'Full_Family'
in the first row:
data = {
"Child": ["A032", "A001"],
"Parent": ["A001", "A043"],
"Ult_Parent": ["A039", "A039"],
"Full_Family": [
"A001, A039, A032, A040, A041, A043, A043, A045, A046",
"A001, A032, A039, A040, A041, A043, A043, A045, A046",
],
}
df = pd.DataFrame(data)
Using a custom key:
df["Correct_Order"] = df.apply(
lambda row: ", ".join(
sorted(
[row["Parent"], row["Child"], row["Ult_Parent"]],
key=lambda x: {
val: idx for idx, val in enumerate(row["Full_Family"].split(", "))
}[x],
)
),
axis=1,
)
df["Correct_Ult_Parent_per_Family"] = df["Correct_Order"].str.split().str[-1].max()
Child Parent Ult_Parent Full_Family Correct_Order Correct_Ult_Parent_per_Family
0 A032 A001 A039 A001, A039, A032, A040, A041, A043, A043, A045... A001, A039, A032 A043
1 A001 A043 A039 A001, A032, A039, A040, A041, A043, A043, A045... A001, A039, A043 A043