I'm working with a dataset where I want to assign a sub_graph ID to user interactions. Each row in the data represents a directed edge between an actor_user_id and a related_user_id.
I want to compute a sub_graph ID such that:
Rows belong to the same sub_graph if they are connected (even indirectly) through shared users.
However, if two edges only share the same actor_user_id, they should not be grouped unless their related_user_ids are also connected.
Here’s a simplified example:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
# Dummy data
edges = pd.DataFrame({
'properties_id': ['A', 'A', 'A', 'A'],
'global_journey_id': ['A1', 'A1', 'A1', 'A1'],
'actor_user_id': ['abc', 'abc', 'pat', 'abc'],
'related_user_id': ['def', 'efg', 'def', 'lal'],
})
actor_user_id related_user_id sub_graph
abc def 1
pat def 1
abc efg 2
abc lal 3
I've tried using NetworkX like this:
G = nx.DiGraph()
G.add_edges_from(zip(edges['actor_user_id'], edges['related_user_id']))
components = list(nx.weakly_connected_components(G))
But this gives me only one connected component because abc is a common node across many edges.
Question:
How can I build logic that ensures subgraphs are only formed when there is a true "shared connection", not just a shared actor_user_id?
Is there a way to force the graph to disconnect branches that don’t have overlapping related_user_ids?
Any idea or trick to break down these graphs properly (even custom component grouping logic) would be super appreciated
Edit : Thanks to Daniel Raphael comment below I was able to get the proper code working as it's looking at related_user_ids > actor_user_id
import pandas as pd
from collections import defaultdict, deque
# Sample input
data = {
'properties_id': ['A', 'A', 'A', 'A'],
'global_journey_id': ['A1', 'A1', 'A1', 'A1'],
'actor_user_id': ['abc', 'abc', 'pat', 'abc'],
'related_user_id': ['def', 'efg', 'def', 'lal'],
}
df = pd.DataFrame(data)
# Container for results
results = []
# Group by properties_id and global_journey_id
grouped = df.groupby(['properties_id', 'global_journey_id'])
for (prop_id, journey_id), group in grouped:
# Step 1: Build directional graph: related_user_id → list of actor_user_id
graph = defaultdict(list)
for _, row in group.iterrows():
graph[row['related_user_id']].append(row['actor_user_id'])
# Step 2: Build edge list (frozenset for undirected identity)
edges = [frozenset([row['actor_user_id'], row['related_user_id']]) for _, row in group.iterrows()]
edge_set = set(edges)
# Step 3: Traverse the directed graph to find connected edge groups
visited_edges = set()
edge_to_subgraph = {}
subgraph_id = 1
for edge in edge_set:
if edge in visited_edges:
continue
node1, node2 = list(edge)
start_node = node2 if node2 in graph else node1
if start_node not in graph:
continue
queue = deque([start_node])
connected_edges = set()
while queue:
current = queue.popleft()
for target in graph.get(current, []):
candidate_edge = frozenset([current, target])
if candidate_edge in edge_set and candidate_edge not in visited_edges:
visited_edges.add(candidate_edge)
connected_edges.add(candidate_edge)
if target in graph:
queue.append(target)
for e in connected_edges:
edge_to_subgraph[e] = subgraph_id
subgraph_id += 1
# Step 4: Map subgraph ID to each row in the group
for _, row in group.iterrows():
edge_key = frozenset([row['actor_user_id'], row['related_user_id']])
subgraph = edge_to_subgraph.get(edge_key, None)
results.append({
'properties_id': row['properties_id'],
'global_journey_id': row['global_journey_id'],
'actor_user_id': row['actor_user_id'],
'related_user_id': row['related_user_id'],
'sub_graph': subgraph
})
# Final result
final_df = pd.DataFrame(results)
final_df = final_df.sort_values(by=['properties_id', 'global_journey_id', 'sub_graph'])
print(final_df)
Here is an idea, flip the logic:
Don’t connect actor → related in the graph.
Instead, connect related_user_ids
that co-occur under the same actor_user_id
.
Then, build a graph only of related_user_ids
and their indirect connections via shared actors.
Finally, map those groupings back to the original rows.
Why this works
This approach ensures:
Two interactions are in the same subgraph only if their related_user_ids
are connected through common co-occurrence under actors.
Shared actor_user_id
does not cause grouping unless the related_user_ids
are also related.
I could provide a minimal snippet tomorrow if needed let me know