What's the best way to group by more than one column?

Essentially, I want to group all records that match either key

Input:

|id|key1|key2|
|-|-|-|
|1|a|x|
|2|a|y|
|3|b|y|
|4|c|z|

Desired output: (The first 3 rows are grouped together)

|key1 (or any identifier, I only care about the final counts)|len|
|-|-|
|a|3|
|c|1|

My attempted (incomplete) solution:

import polars as pl

df = pl.DataFrame({
    'id': [1, 2, 3, 4],
    'key1': ['a', 'a', 'b', 'c'],
    'key2': ['x', 'y', 'y', 'z'],
})
df = (
    df.group_by(
        'key1',
        'key2',
    )
    .len()
    .group_by('key1')
    .agg(
        pl.col('key2'),
        pl.col('len').sum()
    )
)
print(df)

Which gives:

|key1|key2|len|
|-|-|-|
|a|["y", "x"]|2|
|b|["y"]|1|
|c|["z"]|1|

However, I'm not sure how I would further group this by key2 (merging the a and b rows, since they have a common value of y) while preserving the sum of len

Solution

If I understand you correctly, you want to create equivalence relationships between key1 and key2, and keep merging things into groups if there exists a relationship chain that connects the two? So in this case since there is a row a, y any rows which have a or y in them should be merged?

In this case it's hard to express this directly in Polars, but you can do a hybrid with scipy's DisjointSet.

First, gather all the unique keys and relationships:

keys = df.select(pl.col.key1.append(pl.col.key2).unique())
edges = df.select(["key1", "key2"]).unique()

Then, insert them into a DisjointSet and merge into groups:

from scipy.cluster.hierarchy import DisjointSet

ds = DisjointSet(keys.to_series().to_list())
for a, b in edges.iter_rows():
    ds.merge(a, b)

Then convert the DisjointSet into a DataFrame mapping key to group index:

group_idx = (
    pl.DataFrame({"key": [list(s) for s in ds.subsets()]})
        .lazy()
        .with_row_index("group_idx")
        .explode("key")
)

Finally, we join the group_idx to our original dataframe and group by the group index:

(df.lazy()
    .join(group_idx, left_on="key1", right_on="key")
    .group_by("group_idx")
    .len()
    .collect())

Thus giving our output:

┌───────────┬─────┐
│ group_idx ┆ len │
│ ---       ┆ --- │
│ u32       ┆ u32 │
╞═══════════╪═════╡
│ 0         ┆ 1   │
│ 1         ┆ 3   │
└───────────┴─────┘