I have a table of company data that links subsidiary to parent companies as shown in the left hand side table of the screenshot. I need to transform the data into the table on the right hand side of the screenshot. This requires tracing through the two columns of the table and making the link between individual rows.
So far the only thing I have tried is to join the table against itself recursively.. but I am thinking that some kind of tree structure here would make more sense? I.e. creating a branch of all connected companies with the 'ultimate parent company' as the trunk?
These concepts are new to me so appreciate any input
Reproducible input:
df = pd.DataFrame({'Subsidiary Company': ['Company B', 'Company C', 'Company D', 'Company 2', 'Company 3'],
'Parent Company': ['Company A', 'Company B', 'Company C', 'Company 1', 'Company 2']})
You can use networkx
to form a directed graph, then loop over the paths with all_simple_paths
:
import numpy as np
import networkx as nx
# create the directed graph
G = nx.from_pandas_edgelist(df, source='Subsidiary Company',
target='Parent Company',
create_using=nx.DiGraph)
# find roots (final level companies)
roots = {v for v, d in G.in_degree() if d == 0}
# {'Company 3', 'Company D'}
# find leaves (ultimate parents)
leaves = {v for v, d in G.out_degree() if d == 0}
# {'Company 1', 'Company A'}
# function to rename the columns
# 0 is the subsidiary company
# inf is the ultimate parent
# other numbers are intermediates
def renamer(x):
if x == 0:
return 'Subsidiary Company'
if np.isfinite(x):
return f'Intermediate Parent Company {int(x)}'
return 'Ultimate Parent Company'
# find the connected nodes
# the roots/leaves for each group
# iterate over the paths between each root/node combination
# create the sliding paths (with range)
# convert to DataFrame
out = (
pd.DataFrame(
[
dict(enumerate(p[i-1:-1])) | {float('inf'): p[-1]}
for c in nx.weakly_connected_components(G)
for r in c & roots
for l in c & leaves
for p in nx.all_simple_paths(G, r, l)
for i in range(len(p)-1, 0, -1)
]
)
.sort_index(axis=1)
.rename(columns=renamer)
)
Output:
Subsidiary Company Intermediate Parent Company 1 Intermediate Parent Company 2 Ultimate Parent Company
0 Company B NaN NaN Company A
1 Company C Company B NaN Company A
2 Company D Company C Company B Company A
3 Company 2 NaN NaN Company 1
4 Company 3 Company 2 NaN Company 1