pythonpandasdataframetreebranch

Data transformation on pandas dataframe to connect related rows based on shared values


Visual representation of transformation that needs to occur

I have a table of company data that links subsidiary to parent companies as shown in the left hand side table of the screenshot. I need to transform the data into the table on the right hand side of the screenshot. This requires tracing through the two columns of the table and making the link between individual rows.

So far the only thing I have tried is to join the table against itself recursively.. but I am thinking that some kind of tree structure here would make more sense? I.e. creating a branch of all connected companies with the 'ultimate parent company' as the trunk?

These concepts are new to me so appreciate any input

Reproducible input:

df = pd.DataFrame({'Subsidiary Company': ['Company B', 'Company C', 'Company D', 'Company 2', 'Company 3'],
                   'Parent Company': ['Company A', 'Company B', 'Company C', 'Company 1', 'Company 2']})

Solution

  • You can use networkx to form a directed graph, then loop over the paths with all_simple_paths:

    import numpy as np
    import networkx as nx
    
    # create the directed graph
    G = nx.from_pandas_edgelist(df, source='Subsidiary Company',
                                target='Parent Company',
                                create_using=nx.DiGraph)
    
    # find roots (final level companies)
    roots = {v for v, d in G.in_degree() if d == 0}
    # {'Company 3', 'Company D'}
    
    # find leaves (ultimate parents)
    leaves = {v for v, d in G.out_degree() if d == 0}
    # {'Company 1', 'Company A'}
    
    # function to rename the columns
    # 0 is the subsidiary company
    # inf is the ultimate parent
    # other numbers are intermediates
    def renamer(x):
        if x == 0:
            return 'Subsidiary Company'
        if np.isfinite(x):
            return f'Intermediate Parent Company {int(x)}'
        return 'Ultimate Parent Company'
    
    # find the connected nodes
    # the roots/leaves for each group
    # iterate over the paths between each root/node combination
    # create the sliding paths (with range) 
    # convert to DataFrame
    out = (
        pd.DataFrame(
            [
                dict(enumerate(p[i-1:-1])) | {float('inf'): p[-1]}
                for c in nx.weakly_connected_components(G)
                for r in c & roots
                for l in c & leaves
                for p in nx.all_simple_paths(G, r, l)
                for i in range(len(p)-1, 0, -1)
            ]
        )
        .sort_index(axis=1)
        .rename(columns=renamer)
    )
    

    Output:

      Subsidiary Company Intermediate Parent Company 1 Intermediate Parent Company 2 Ultimate Parent Company
    0          Company B                           NaN                           NaN               Company A
    1          Company C                     Company B                           NaN               Company A
    2          Company D                     Company C                     Company B               Company A
    3          Company 2                           NaN                           NaN               Company 1
    4          Company 3                     Company 2                           NaN               Company 1