pythongroup-byaggregatepython-polarsdrop-duplicates

polars equivalent of pandas groupby.apply(drop_duplicates)


I am new to polars and I wonder what is the equivalent of pandas groupby.apply(drop_duplicates) in polars. Here is the code snippet I need to translate :

import pandas as pd

GROUP = list('123231232121212321')
OPERATION = list('AAABBABAAAABBABBBA')
BATCH = list('777898897889878987')

df_input = pd.DataFrame({'GROUP':GROUP, 'OPERATION':OPERATION, 'BATCH':BATCH})
df_output = df_input.groupby('GROUP').apply(lambda x: x.drop_duplicates())

input data desired output

I tried the following, but, it does not output what I need

import polars as pl

GROUP = list('123231232121212321')
OPERATION = list('AAABBABAAAABBABBBA')
BATCH = list('777898897889878987')

df_input = pl.DataFrame({'GROUP':GROUP, 'OPERATION':OPERATION, 'BATCH':BATCH})
df_output = df_input.group_by('GROUP').agg(pl.all().unique())

If I take only one Group, I get locally what I want :

df_part = df_input.filter(pl.col('GROUP')=='2')
df_part[['OPERATION', 'BATCH']].unique()

Does somebody know how to do that ?


Solution

  • It looks like you want the first instance of each OPERATION, BATCH "pairing" per GROUP

    You can use pl.struct to create the "pairing" and then use is_first_distinct() as a Window function.

    (df.with_row_index()
       .filter(pl.struct("OPERATION", "BATCH").is_first_distinct().over("GROUP"))
    )
    
    shape: (9, 4)
    ┌───────┬───────┬───────────┬───────┐
    │ index ┆ GROUP ┆ OPERATION ┆ BATCH │
    │ ---   ┆ ---   ┆ ---       ┆ ---   │
    │ u32   ┆ str   ┆ str       ┆ str   │
    ╞═══════╪═══════╪═══════════╪═══════╡
    │ 0     ┆ 1     ┆ A         ┆ 7     │
    │ 1     ┆ 2     ┆ A         ┆ 7     │
    │ 2     ┆ 3     ┆ A         ┆ 7     │
    │ 3     ┆ 2     ┆ B         ┆ 8     │
    │ 4     ┆ 3     ┆ B         ┆ 9     │
    │ 5     ┆ 1     ┆ A         ┆ 8     │
    │ 7     ┆ 3     ┆ A         ┆ 9     │
    │ 10    ┆ 2     ┆ A         ┆ 8     │
    │ 11    ┆ 1     ┆ B         ┆ 9     │
    └───────┴───────┴───────────┴───────┘
    

    The with_row_index is just used here as a visual guide to help see the removed rows. (index 8, 9)