In Polars, how do you create a group counter / group ID?

How do you get a group_id column like this, grouping by columns col1 and col2 ?

col1	col2	group_id
A	Z	1
A	Y	2
A	Z	1
B	Z	3

based on such a DataFrame :

df = pl.DataFrame({
  'col1': ['A', 'A', 'A', 'B'],
  'col2': ['Z', 'Y', 'Z', 'Z']}
)

In other words, I'm looking for polars equivalent to R data.table .GRP (df[, group_id:=.GRP, by = .(col1, col2)]). Thanks !

Context : I want to build an event ID because in my data, I have many detailed rows for one event. Once the event ID is created, I will use it to perform various window operations. I prefer to have this event ID rather than keeping a list of grouping variables.

Solution

If you don't care about "first occurrence has lower rank" then you can just rank combination of col1 and col2 columns.

pl.struct() to create one column out of col1, col2.
pl.Expr.rank() to assign rank to rows.

df.with_columns(group_id = pl.struct("col1","col2").rank("dense"))

shape: (4, 3)
┌──────┬──────┬──────────┐
│ col1 ┆ col2 ┆ group_id │
│ ---  ┆ ---  ┆ ---      │
│ str  ┆ str  ┆ u32      │
╞══════╪══════╪══════════╡
│ A    ┆ Z    ┆ 2        │
│ A    ┆ Y    ┆ 1        │
│ A    ┆ Z    ┆ 2        │
│ B    ┆ Z    ┆ 3        │
└──────┴──────┴──────────┘

Otherwise, it will be a bit more complicated - you can create index to track current row order and then assign rank based on this index.

pl.DataFrame.with_row_index() to create an index / row number.
pl.Expr.min() and pl.Expr.over() to get minimum index within col, col2.
pl.Expr.rank() to "compress" this minimum to dense rank.

(
    df.with_row_index("group_id")
    .with_columns(
        pl.col.group_id.min().over("col1","col2").rank("dense")
    )
)

shape: (4, 3)
┌──────────┬──────┬──────┐
│ group_id ┆ col1 ┆ col2 │
│ ---      ┆ ---  ┆ ---  │
│ u32      ┆ str  ┆ str  │
╞══════════╪══════╪══════╡
│ 1        ┆ A    ┆ Z    │
│ 2        ┆ A    ┆ Y    │
│ 1        ┆ A    ┆ Z    │
│ 3        ┆ B    ┆ Z    │
└──────────┴──────┴──────┘