pandas string dataframe hierarchy string-matching

Identifying Correct String Order in Pandas

I have a dataframe as the following, showing the relationship of different entities in each row.

Child	Parent	Ult_Parent	Full_Family
A032	A001	A039	A001, A032, A039, A040, A041, A043, A043, A045, A046

In the "Full_Family" column, it's showing the correct hierarchy for the whole family tree from small to big, not just entities in that specific row. But to be aware of, not every entity in the "Full_Family" will appear in the "Child/Parent/Ult_Parent", as the "Full_Family" data is coming from the other source.

Here I have 2 problems need to solve:

The Child/Parent/Ult_Parent order is not correct, how to based on the "Full_Family" column to identify the correct order?
If possible, Can I also identify the correct Ult_Parent for a family in the whole table instead of each row. This correct Ult_Parent needs to be the entity that has appeared in Child/Parent/Ult_Parent columns.

Below is the example and ideal outcome:

Child	Parent	Ult_Parent	Full_Family	Correct_Order	Correct_Ult_Parent_per_Family
A032	A001	A039	A001, A032, A039, A040, A041, A043, A043, A045, A046	A001, A032, A039	A043
A001	A043	A039	A001, A032, A039, A040, A041, A043, A043, A045, A046	A001, A039, A043	A043

Although A046 is the ultimate parent in this family, but it does not appear in this dataframe's Child/Parent/Ult_Parent, so the correct ultimate parent per family is A043 in this case.

Appreciate the help.

Solution

IIUC, assuming this dataframe as input:

import pandas as pd

data = {
    "Child": ["A032", "A001"],
    "Parent": ["A001", "A043"],
    "Ult_Parent": ["A039", "A039"],
    "Full_Family": [
        "A001, A032, A039, A040, A041, A043, A043, A045, A046",
        "A001, A032, A039, A040, A041, A043, A043, A045, A046",
    ],
}

df = pd.DataFrame(data)

  Child Parent Ult_Parent                                        Full_Family
0  A032   A001       A039  A001, A032, A039, A040, A041, A043, A043, A045...
1  A001   A043       A039  A001, A032, A039, A040, A041, A043, A043, A045...

You can use this approach:

df["Correct_Order"] = df.apply(
    lambda row: ", ".join(sorted([row["Parent"], row["Child"], row["Ult_Parent"]])),
    axis=1,
)

df["Correct_Ult_Parent_per_Family"] = (
    df[["Parent", "Child", "Ult_Parent"]].max(axis=1).max()
)

  Child Parent Ult_Parent                                        Full_Family     Correct_Order Correct_Ult_Parent_per_Family
0  A032   A001       A039  A001, A032, A039, A040, A041, A043, A043, A045...  A001, A032, A039                          A043
1  A001   A043       A039  A001, A032, A039, A040, A041, A043, A043, A045...  A001, A039, A043                          A043

If 'Full_Family' is not necessarily in ascending order, and you want to respect its order, you can define a custom key to sorted.

For example, if A039 comes before A032 in 'Full_Family' in the first row:

data = {
    "Child": ["A032", "A001"],
    "Parent": ["A001", "A043"],
    "Ult_Parent": ["A039", "A039"],
    "Full_Family": [
        "A001, A039, A032, A040, A041, A043, A043, A045, A046",
        "A001, A032, A039, A040, A041, A043, A043, A045, A046",
    ],
}

df = pd.DataFrame(data)

Using a custom key:

df["Correct_Order"] = df.apply(
    lambda row: ", ".join(
        sorted(
            [row["Parent"], row["Child"], row["Ult_Parent"]],
            key=lambda x: {
                val: idx for idx, val in enumerate(row["Full_Family"].split(", "))
            }[x],
        )
    ),
    axis=1,
)

df["Correct_Ult_Parent_per_Family"] = df["Correct_Order"].str.split().str[-1].max()

  Child Parent Ult_Parent                                        Full_Family     Correct_Order Correct_Ult_Parent_per_Family
0  A032   A001       A039  A001, A039, A032, A040, A041, A043, A043, A045...  A001, A039, A032                          A043
1  A001   A043       A039  A001, A032, A039, A040, A041, A043, A043, A045...  A001, A039, A043                          A043