pythonpython-3.xgroup-bypython-polars

Group-By column in polars DataFrame inside with_columns


I have the following dataframe:

import polars as pl

df = pl.DataFrame({
    'ID': [1, 1, 5, 5, 7, 7, 7],
    'YEAR': [2025, 2025, 2023, 2024, 2020, 2021, 2021]
})
shape: (7, 2)
┌─────┬──────┐
│ ID  ┆ YEAR │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ 2025 │
│ 1   ┆ 2025 │
│ 5   ┆ 2023 │
│ 5   ┆ 2024 │
│ 7   ┆ 2020 │
│ 7   ┆ 2021 │
│ 7   ┆ 2021 │
└─────┴──────┘

Now I would like to get the unique number of years per ID, i.e.

shape: (7, 3)
┌─────┬──────┬──────────────┐
│ ID  ┆ YEAR ┆ UNIQUE_YEARS │
│ --- ┆ ---  ┆ ---          │
│ i64 ┆ i64  ┆ u32          │
╞═════╪══════╪══════════════╡
│ 1   ┆ 2025 ┆ 1            │
│ 1   ┆ 2025 ┆ 1            │
│ 5   ┆ 2023 ┆ 2            │
│ 5   ┆ 2024 ┆ 2            │
│ 7   ┆ 2020 ┆ 2            │
│ 7   ┆ 2021 ┆ 2            │
│ 7   ┆ 2021 ┆ 2            │
└─────┴──────┴──────────────┘

So I tried df.with_columns(pl.col('YEAR').over('ID').alias('UNIQUE_YEARS')) but this gives the wrong result. So I came up with

df.join(df.group_by('ID').agg(pl.col('YEAR').unique().len().alias('UNIQUE_YEARS')), on='ID', how='left')

which does gives correct result! But it looks a bit clunky, and I wonder if there is a more natural way using with_columns and over?


Solution

  • You can use Expr.n_unique:

    out = df.with_columns(
        pl.col('YEAR').n_unique().over('ID').alias('UNIQUE_YEARS')
        )
    

    Output:

    shape: (7, 3)
    ┌─────┬──────┬──────────────┐
    │ ID  ┆ YEAR ┆ UNIQUE_YEARS │
    │ --- ┆ ---  ┆ ---          │
    │ i64 ┆ i64  ┆ u32          │
    ╞═════╪══════╪══════════════╡
    │ 1   ┆ 2025 ┆ 1            │
    │ 1   ┆ 2025 ┆ 1            │
    │ 5   ┆ 2023 ┆ 2            │
    │ 5   ┆ 2024 ┆ 2            │
    │ 7   ┆ 2020 ┆ 2            │
    │ 7   ┆ 2021 ┆ 2            │
    │ 7   ┆ 2021 ┆ 2            │
    └─────┴──────┴──────────────┘
    

    Similarly, groupby can take .n_unique() instead of your .unique().len().