Consider the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"main": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"component": [
[1, 2],
[np.nan],
[3, 8],
[np.nan],
[1, 5, 6],
[np.nan],
[7],
[np.nan],
[9, 10],
[np.nan],
[np.nan],
],
}
)
The column main
represents a certain approach. Each approach consists of components. A component itself could also be an approach and is then called sub-approach.
I want to find all connected sub-approaches/components for a certain approach.
Suppose, for instance, I want to find all connected sub-approaches/components for the main approach '0'. Then, my desired output would look like this:
target = pd.DataFrame({
"main": [0, 0, 2, 2, 8, 8],
"component": [1, 2, 3, 8, 9, 10]
})
Ideally, I want to be able to just choose the approach and then get all sub-connections.
I am convinced that there is a smart approach to do so using networkx
. Any hint is appreciated.
Ultimately, I want to create a graph that looks somewhat like this (for approach 0):
Additional information:
You can explode the data frame and then remove all components from the main
column (components are approaches that do not have any component).
df_exploded = df.explode(column="component").dropna(subset="component")
The graph can be constructed as follows:
import networkx as nx
import graphviz
G = nx.Graph()
G.add_edges_from([(i, j) for i, j in target.values])
graph_attr = dict(rankdir="LR", nodesep="0.2")
g = graphviz.Digraph(graph_attr=graph_attr)
for k, v in G.nodes.items():
g.node(str(k), shape="box", style="filled", height="0.35")
for n1, n2 in G.edges:
g.edge(str(n2), str(n1))
g
You can use nx.dfs_edges
edges = df.explode(column='component').dropna(subset='component')
G = nx.from_pandas_edgelist(edges, source='main', target='component', create_using=nx.DiGraph)
target = pd.DataFrame(nx.dfs_edges(G, 0), columns=['main', 'component'])
Output:
>>> target
main component
0 0 1
1 0 2
2 2 3
3 2 8
4 8 9
5 8 10
To extract the subgraph, use:
H = G.edge_subgraph(nx.dfs_edges(G, 0))