pythonpandasnetworkx

How to assign subgraph IDs based on weakly connected user pairs, but split when no shared connection exists


I'm working with a dataset where I want to assign a sub_graph ID to user interactions. Each row in the data represents a directed edge between an actor_user_id and a related_user_id.

I want to compute a sub_graph ID such that:

Rows belong to the same sub_graph if they are connected (even indirectly) through shared users.

However, if two edges only share the same actor_user_id, they should not be grouped unless their related_user_ids are also connected.

Here’s a simplified example:

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

# Dummy data
edges = pd.DataFrame({
    'properties_id': ['A', 'A', 'A', 'A'],
    'global_journey_id': ['A1', 'A1', 'A1', 'A1'],
    'actor_user_id': ['abc', 'abc', 'pat', 'abc'],
    'related_user_id': ['def', 'efg', 'def', 'lal'],
})
actor_user_id  related_user_id  sub_graph
abc            def              1
pat            def              1
abc            efg              2
abc            lal              3

I've tried using NetworkX like this:

G = nx.DiGraph()
G.add_edges_from(zip(edges['actor_user_id'], edges['related_user_id']))
components = list(nx.weakly_connected_components(G))

But this gives me only one connected component because abc is a common node across many edges.

Question:

How can I build logic that ensures subgraphs are only formed when there is a true "shared connection", not just a shared actor_user_id?

Is there a way to force the graph to disconnect branches that don’t have overlapping related_user_ids?

Any idea or trick to break down these graphs properly (even custom component grouping logic) would be super appreciated

Edit : Thanks to Daniel Raphael comment below I was able to get the proper code working as it's looking at related_user_ids > actor_user_id

import pandas as pd
from collections import defaultdict, deque

# Sample input
data = {
    'properties_id': ['A', 'A', 'A', 'A'],
    'global_journey_id': ['A1', 'A1', 'A1', 'A1'],
    'actor_user_id': ['abc', 'abc', 'pat', 'abc'],
    'related_user_id': ['def', 'efg', 'def', 'lal'],
}
df = pd.DataFrame(data)

# Container for results
results = []

# Group by properties_id and global_journey_id
grouped = df.groupby(['properties_id', 'global_journey_id'])

for (prop_id, journey_id), group in grouped:
    # Step 1: Build directional graph: related_user_id → list of actor_user_id
    graph = defaultdict(list)
    for _, row in group.iterrows():
        graph[row['related_user_id']].append(row['actor_user_id'])

    # Step 2: Build edge list (frozenset for undirected identity)
    edges = [frozenset([row['actor_user_id'], row['related_user_id']]) for _, row in group.iterrows()]
    edge_set = set(edges)

    # Step 3: Traverse the directed graph to find connected edge groups
    visited_edges = set()
    edge_to_subgraph = {}
    subgraph_id = 1

    for edge in edge_set:
        if edge in visited_edges:
            continue

        node1, node2 = list(edge)
        start_node = node2 if node2 in graph else node1
        if start_node not in graph:
            continue

        queue = deque([start_node])
        connected_edges = set()

        while queue:
            current = queue.popleft()
            for target in graph.get(current, []):
                candidate_edge = frozenset([current, target])
                if candidate_edge in edge_set and candidate_edge not in visited_edges:
                    visited_edges.add(candidate_edge)
                    connected_edges.add(candidate_edge)
                    if target in graph:
                        queue.append(target)

        for e in connected_edges:
            edge_to_subgraph[e] = subgraph_id

        subgraph_id += 1

    # Step 4: Map subgraph ID to each row in the group
    for _, row in group.iterrows():
        edge_key = frozenset([row['actor_user_id'], row['related_user_id']])
        subgraph = edge_to_subgraph.get(edge_key, None)
        results.append({
            'properties_id': row['properties_id'],
            'global_journey_id': row['global_journey_id'],
            'actor_user_id': row['actor_user_id'],
            'related_user_id': row['related_user_id'],
            'sub_graph': subgraph
        })

# Final result
final_df = pd.DataFrame(results)
final_df = final_df.sort_values(by=['properties_id', 'global_journey_id', 'sub_graph'])
print(final_df)

Solution

  • Here is an idea, flip the logic: