I need a way to find out how many unique pairs of values from two columns are in a certain context. Basically like n_unique, but as a window function.
To illustrate with a toy example:
import polars as pl
dataframe = pl.DataFrame({
'context': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'column1': [1, 1, 0, 1, 0, 0, 1, 0, 1],
'column2': [1, 0, 0, 0, 1, 1, 1, 0, 1]
# unique: 1 2 3 1 2 - 1 2 -
# n_unique: -- 3 -- -- 2 -- -- 2 --
})
I would like to write:
dataframe = (
dataframe
.with_columns(
pl.n_unique('column1', 'column2').over('context').alias('n_unique')
)
)
to get the number of unique value pairs from column1, column2 within the window of column 'context'. But that does not work.
One attempt I made was this:
(dataframe
.with_columns(
pl.concat_list('column1', 'column2').alias('pair')
)
.with_columns(
pl.n_unique('pair').over('context')
)
)
Which works, but is there a better way?
All expressions are this functional construct Fn(Series) -> Series
. Meaning that if you want to compute something over multiple columns, you must ensure that there are multiple columns in the input Series
.
We can easily do this by packing them into a Struct
data type.
import polars as pl
df = pl.DataFrame({
'context': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'column1': [1, 1, 0, 1, 0, 0, 1, 0, 1],
'column2': [1, 0, 0, 0, 1, 1, 1, 0, 1]
})
df.with_columns(
pl.struct("column1", "column2").n_unique().over("context").alias("n_unique")
)
shape: (9, 4)
┌─────────┬─────────┬─────────┬──────────┐
│ context ┆ column1 ┆ column2 ┆ n_unique │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ u32 │
╞═════════╪═════════╪═════════╪══════════╡
│ 1 ┆ 1 ┆ 1 ┆ 3 │
│ 1 ┆ 1 ┆ 0 ┆ 3 │
│ 1 ┆ 0 ┆ 0 ┆ 3 │
│ 2 ┆ 1 ┆ 0 ┆ 2 │
│ 2 ┆ 0 ┆ 1 ┆ 2 │
│ 2 ┆ 0 ┆ 1 ┆ 2 │
│ 3 ┆ 1 ┆ 1 ┆ 2 │
│ 3 ┆ 0 ┆ 0 ┆ 2 │
│ 3 ┆ 1 ┆ 1 ┆ 2 │
└─────────┴─────────┴─────────┴──────────┘