pythondataframepython-polarspolars

What's the best way to group by more than one column?


Essentially, I want to group all records that match either key

Input:

|id|key1|key2|
|-|-|-|
|1|a|x|
|2|a|y|
|3|b|y|
|4|c|z|

Desired output: (The first 3 rows are grouped together)

|key1 (or any identifier, I only care about the final counts)|len|
|-|-|
|a|3|
|c|1|

My attempted (incomplete) solution:

import polars as pl

df = pl.DataFrame({
    'id': [1, 2, 3, 4],
    'key1': ['a', 'a', 'b', 'c'],
    'key2': ['x', 'y', 'y', 'z'],
})
df = (
    df.group_by(
        'key1',
        'key2',
    )
    .len()
    .group_by('key1')
    .agg(
        pl.col('key2'),
        pl.col('len').sum()
    )
)
print(df)

Which gives:

|key1|key2|len|
|-|-|-|
|a|["y", "x"]|2|
|b|["y"]|1|
|c|["z"]|1|

However, I'm not sure how I would further group this by key2 (merging the a and b rows, since they have a common value of y) while preserving the sum of len


Solution

  • If I understand you correctly, you want to create equivalence relationships between key1 and key2, and keep merging things into groups if there exists a relationship chain that connects the two? So in this case since there is a row a, y any rows which have a or y in them should be merged?

    In this case it's hard to express this directly in Polars, but you can do a hybrid with scipy's DisjointSet.

    First, gather all the unique keys and relationships:

    keys = df.select(pl.col.key1.append(pl.col.key2).unique())
    edges = df.select(["key1", "key2"]).unique()
    

    Then, insert them into a DisjointSet and merge into groups:

    from scipy.cluster.hierarchy import DisjointSet
    
    ds = DisjointSet(keys.to_series().to_list())
    for a, b in edges.iter_rows():
        ds.merge(a, b)
    

    Then convert the DisjointSet into a DataFrame mapping key to group index:

    group_idx = (
        pl.DataFrame({"key": [list(s) for s in ds.subsets()]})
            .lazy()
            .with_row_index("group_idx")
            .explode("key")
    )
    

    Finally, we join the group_idx to our original dataframe and group by the group index:

    (df.lazy()
        .join(group_idx, left_on="key1", right_on="key")
        .group_by("group_idx")
        .len()
        .collect())
    
    

    Thus giving our output:

    ┌───────────┬─────┐
    │ group_idx ┆ len │
    │ ---       ┆ --- │
    │ u32       ┆ u32 │
    ╞═══════════╪═════╡
    │ 0         ┆ 1   │
    │ 1         ┆ 3   │
    └───────────┴─────┘