pythonpython-polars

In Polars, how do you create a group counter / group ID?


How do you get a group_id column like this, grouping by columns col1 and col2 ?

col1 col2 group_id
A Z 1
A Y 2
A Z 1
B Z 3

based on such a DataFrame :

df = pl.DataFrame({
  'col1': ['A', 'A', 'A', 'B'],
  'col2': ['Z', 'Y', 'Z', 'Z']}
)

In other words, I'm looking for polars equivalent to R data.table .GRP (df[, group_id:=.GRP, by = .(col1, col2)]). Thanks !

Context : I want to build an event ID because in my data, I have many detailed rows for one event. Once the event ID is created, I will use it to perform various window operations. I prefer to have this event ID rather than keeping a list of grouping variables.


Solution

  • If you don't care about "first occurrence has lower rank" then you can just rank combination of col1 and col2 columns.

    df.with_columns(group_id = pl.struct("col1","col2").rank("dense"))
    
    shape: (4, 3)
    ┌──────┬──────┬──────────┐
    │ col1 ┆ col2 ┆ group_id │
    │ ---  ┆ ---  ┆ ---      │
    │ str  ┆ str  ┆ u32      │
    ╞══════╪══════╪══════════╡
    │ A    ┆ Z    ┆ 2        │
    │ A    ┆ Y    ┆ 1        │
    │ A    ┆ Z    ┆ 2        │
    │ B    ┆ Z    ┆ 3        │
    └──────┴──────┴──────────┘
    

    Otherwise, it will be a bit more complicated - you can create index to track current row order and then assign rank based on this index.

    (
        df.with_row_index("group_id")
        .with_columns(
            pl.col.group_id.min().over("col1","col2").rank("dense")
        )
    )
    
    shape: (4, 3)
    ┌──────────┬──────┬──────┐
    │ group_id ┆ col1 ┆ col2 │
    │ ---      ┆ ---  ┆ ---  │
    │ u32      ┆ str  ┆ str  │
    ╞══════════╪══════╪══════╡
    │ 1        ┆ A    ┆ Z    │
    │ 2        ┆ A    ┆ Y    │
    │ 1        ┆ A    ┆ Z    │
    │ 3        ┆ B    ┆ Z    │
    └──────────┴──────┴──────┘