I have a dataframe like this:
index | A | B | C |
---|---|---|---|
1 | 111 | 222 | 111 |
2 | 111 | 222 | 222 |
3 | 111 | 111 | 555 |
4 | 222 | 222 | 444 |
5 | 222 | 333 | 111 |
6 | 222 | 444 | 333 |
7 | 333 | 555 | 777 |
8 | 444 | 666 | 777 |
df = pd.DataFrame({
'A': [111,111,111,222,222,222,333,444],
'B': [222,222,111,222,333,444,555,666],
'C': [111,222,555,444,111,333,777,777]
})
I want to create new column 'cluster' and assign same id to records which are connected directly or through common group in one of the columns.
Meaning, here for example, we see that first 3 elements connected by same group in 'A', but they also connected to other records which have same groups '222', '111' in column 'B'. And all records which have '111', '222', '555' in column 'C'.
So basically, all first 6 elements should have same cluster Id.
index | A | B | C | cluster |
---|---|---|---|---|
1 | 111 | 222 | 111 | 1 |
2 | 111 | 222 | 222 | 1 |
3 | 111 | 111 | 555 | 1 |
4 | 222 | 222 | 444 | 1 |
5 | 222 | 333 | 111 | 1 |
6 | 222 | 444 | 333 | 1 |
7 | 333 | 555 | 777 | 2 |
8 | 444 | 666 | 777 | 2 |
Records 4-6 are connected to 1-3 as they form a group in column A and they are connected to previous records through columns B and C.
I was playing with multiple consequent apply functions on pairs of columns, but now thinking of applying connected components here, but can't figure out how to do that.
Also, the main problem is that this dataset is huge, > 30 000 000 records.
Appreciate any help.
You can indeed achieve this using networkx.connected_components
, after reshaping your DataFrame to build the consecutive pairs of edges. This won't be so fast with 33M rows though:
import networkx as nx
tmp = (df
.melt(ignore_index=False, value_name='source')
.assign(target=lambda d: d.groupby(level=0)['source'].shift())
.dropna(subset=['source', 'target'])
.drop_duplicates(subset=['source', 'target'])
)
G = nx.from_pandas_edgelist(tmp)
groups = {n: next(iter(g)) for g in nx.connected_components(G) for n in g}
df['cluster'] = df['A'].map(groups)
Output:
A B C cluster
0 111 02 001 002
1 111 02 002 002
2 111 01 005 002
3 222 02 004 002
4 222 03 001 002
5 222 04 003 002
Graph:
There are several options, an easy one is to use strings and to prefix the column name in the node
tmp1 = df.astype(str).radd([f'{c}|' for c in df])
tmp = (tmp1
.melt(ignore_index=False, value_name='source')
.assign(target=lambda d: d.groupby(level=0)['source'].shift())
.dropna(subset=['source', 'target'])
.drop_duplicates(subset=['source', 'target'])
)
G = nx.from_pandas_edgelist(tmp)
groups = {n: i for i, g in enumerate(nx.connected_components(G), start=1)
for n in g}
df['cluster'] = tmp1['A'].map(groups)
Output:
A B C cluster
0 111 222 111 1
1 111 222 222 1
2 111 111 555 1
3 222 222 444 1
4 222 333 111 1
5 222 444 333 1
6 333 555 777 2
7 444 666 777 2
Graph: