Essentially, I want to group all records that match either key
Input:
|id|key1|key2|
|-|-|-|
|1|a|x|
|2|a|y|
|3|b|y|
|4|c|z|
Desired output: (The first 3 rows are grouped together)
|key1 (or any identifier, I only care about the final counts)|len|
|-|-|
|a|3|
|c|1|
My attempted (incomplete) solution:
import polars as pl
df = pl.DataFrame({
'id': [1, 2, 3, 4],
'key1': ['a', 'a', 'b', 'c'],
'key2': ['x', 'y', 'y', 'z'],
})
df = (
df.group_by(
'key1',
'key2',
)
.len()
.group_by('key1')
.agg(
pl.col('key2'),
pl.col('len').sum()
)
)
print(df)
Which gives:
|key1|key2|len|
|-|-|-|
|a|["y", "x"]|2|
|b|["y"]|1|
|c|["z"]|1|
However, I'm not sure how I would further group this by key2 (merging the a and b rows, since they have a common value of y) while preserving the sum of len
If I understand you correctly, you want to create equivalence relationships between key1
and key2
, and keep merging things into groups if there exists a relationship chain that connects the two? So in this case since there is a row a, y
any rows which have a
or y
in them should be merged?
In this case it's hard to express this directly in Polars, but you can do a hybrid with scipy
's DisjointSet
.
First, gather all the unique keys and relationships:
keys = df.select(pl.col.key1.append(pl.col.key2).unique())
edges = df.select(["key1", "key2"]).unique()
Then, insert them into a DisjointSet
and merge into groups:
from scipy.cluster.hierarchy import DisjointSet
ds = DisjointSet(keys.to_series().to_list())
for a, b in edges.iter_rows():
ds.merge(a, b)
Then convert the DisjointSet
into a DataFrame
mapping key to group index:
group_idx = (
pl.DataFrame({"key": [list(s) for s in ds.subsets()]})
.lazy()
.with_row_index("group_idx")
.explode("key")
)
Finally, we join the group_idx
to our original dataframe and group by the group index:
(df.lazy()
.join(group_idx, left_on="key1", right_on="key")
.group_by("group_idx")
.len()
.collect())
Thus giving our output:
┌───────────┬─────┐
│ group_idx ┆ len │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═══════════╪═════╡
│ 0 ┆ 1 │
│ 1 ┆ 3 │
└───────────┴─────┘