python group-by aggregate python-polars drop-duplicates

polars equivalent of pandas groupby.apply(drop_duplicates)

I am new to polars and I wonder what is the equivalent of pandas groupby.apply(drop_duplicates) in polars. Here is the code snippet I need to translate :

import pandas as pd

GROUP = list('123231232121212321')
OPERATION = list('AAABBABAAAABBABBBA')
BATCH = list('777898897889878987')

df_input = pd.DataFrame({'GROUP':GROUP, 'OPERATION':OPERATION, 'BATCH':BATCH})
df_output = df_input.groupby('GROUP').apply(lambda x: x.drop_duplicates())

input data desired output

I tried the following, but, it does not output what I need

import polars as pl

GROUP = list('123231232121212321')
OPERATION = list('AAABBABAAAABBABBBA')
BATCH = list('777898897889878987')

df_input = pl.DataFrame({'GROUP':GROUP, 'OPERATION':OPERATION, 'BATCH':BATCH})
df_output = df_input.groupby('GROUP').agg(pl.all().unique())

If I take only one Group, I get locally what I want :

df_part = df_input.filter(pl.col('GROUP')=='2')
df_part[['OPERATION', 'BATCH']].unique()

Does somebody know how to do that ?

Solution

It looks like you want the first instance of each OPERATION, BATCH "pairing" per GROUP

You can use pl.struct to create the "pairing" and then use is_first_distinct() as a Window function.

(df.with_row_index()
   .filter(pl.struct("OPERATION", "BATCH").is_first_distinct().over("GROUP"))
)

shape: (9, 4)
┌───────┬───────┬───────────┬───────┐
│ index ┆ GROUP ┆ OPERATION ┆ BATCH │
│ ---   ┆ ---   ┆ ---       ┆ ---   │
│ u32   ┆ str   ┆ str       ┆ str   │
╞═══════╪═══════╪═══════════╪═══════╡
│ 0     ┆ 1     ┆ A         ┆ 7     │
│ 1     ┆ 2     ┆ A         ┆ 7     │
│ 2     ┆ 3     ┆ A         ┆ 7     │
│ 3     ┆ 2     ┆ B         ┆ 8     │
│ 4     ┆ 3     ┆ B         ┆ 9     │
│ 5     ┆ 1     ┆ A         ┆ 8     │
│ 7     ┆ 3     ┆ A         ┆ 9     │
│ 10    ┆ 2     ┆ A         ┆ 8     │
│ 11    ┆ 1     ┆ B         ┆ 9     │
└───────┴───────┴───────────┴───────┘

The with_row_index is just used here as a visual guide to help see the removed rows. (index 8, 9)