How do you get a group_id
column like this, grouping by columns col1
and col2
?
col1 | col2 | group_id |
---|---|---|
A | Z | 1 |
A | Y | 2 |
A | Z | 1 |
B | Z | 3 |
based on such a DataFrame :
df = pl.DataFrame({
'col1': ['A', 'A', 'A', 'B'],
'col2': ['Z', 'Y', 'Z', 'Z']}
)
In other words, I'm looking for polars equivalent to R data.table .GRP
(df[, group_id:=.GRP, by = .(col1, col2)]
). Thanks !
Context : I want to build an event ID because in my data, I have many detailed rows for one event. Once the event ID is created, I will use it to perform various window operations. I prefer to have this event ID rather than keeping a list of grouping variables.
If you don't care about "first occurrence has lower rank" then you can just rank
combination of col1
and col2
columns.
pl.struct()
to create one column out of col1
, col2
.pl.Expr.rank()
to assign rank to rows.df.with_columns(group_id = pl.struct("col1","col2").rank("dense"))
shape: (4, 3)
┌──────┬──────┬──────────┐
│ col1 ┆ col2 ┆ group_id │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════╪══════╪══════════╡
│ A ┆ Z ┆ 2 │
│ A ┆ Y ┆ 1 │
│ A ┆ Z ┆ 2 │
│ B ┆ Z ┆ 3 │
└──────┴──────┴──────────┘
Otherwise, it will be a bit more complicated - you can create index to track current row order and then assign rank based on this index.
pl.DataFrame.with_row_index()
to create an index / row number.pl.Expr.min()
and pl.Expr.over()
to get minimum index within col
, col2
.pl.Expr.rank()
to "compress" this minimum to dense rank.(
df.with_row_index("group_id")
.with_columns(
pl.col.group_id.min().over("col1","col2").rank("dense")
)
)
shape: (4, 3)
┌──────────┬──────┬──────┐
│ group_id ┆ col1 ┆ col2 │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞══════════╪══════╪══════╡
│ 1 ┆ A ┆ Z │
│ 2 ┆ A ┆ Y │
│ 1 ┆ A ┆ Z │
│ 3 ┆ B ┆ Z │
└──────────┴──────┴──────┘