I am new to polars and I wonder what is the equivalent of pandas groupby.apply(drop_duplicates) in polars. Here is the code snippet I need to translate :
import pandas as pd
GROUP = list('123231232121212321')
OPERATION = list('AAABBABAAAABBABBBA')
BATCH = list('777898897889878987')
df_input = pd.DataFrame({'GROUP':GROUP, 'OPERATION':OPERATION, 'BATCH':BATCH})
df_output = df_input.groupby('GROUP').apply(lambda x: x.drop_duplicates())
I tried the following, but, it does not output what I need
import polars as pl
GROUP = list('123231232121212321')
OPERATION = list('AAABBABAAAABBABBBA')
BATCH = list('777898897889878987')
df_input = pl.DataFrame({'GROUP':GROUP, 'OPERATION':OPERATION, 'BATCH':BATCH})
df_output = df_input.group_by('GROUP').agg(pl.all().unique())
If I take only one Group, I get locally what I want :
df_part = df_input.filter(pl.col('GROUP')=='2')
df_part[['OPERATION', 'BATCH']].unique()
Does somebody know how to do that ?
It looks like you want the first instance of each OPERATION, BATCH
"pairing" per GROUP
You can use pl.struct
to create the "pairing" and then use is_first_distinct()
as a Window function.
(df.with_row_index()
.filter(pl.struct("OPERATION", "BATCH").is_first_distinct().over("GROUP"))
)
shape: (9, 4)
┌───────┬───────┬───────────┬───────┐
│ index ┆ GROUP ┆ OPERATION ┆ BATCH │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str │
╞═══════╪═══════╪═══════════╪═══════╡
│ 0 ┆ 1 ┆ A ┆ 7 │
│ 1 ┆ 2 ┆ A ┆ 7 │
│ 2 ┆ 3 ┆ A ┆ 7 │
│ 3 ┆ 2 ┆ B ┆ 8 │
│ 4 ┆ 3 ┆ B ┆ 9 │
│ 5 ┆ 1 ┆ A ┆ 8 │
│ 7 ┆ 3 ┆ A ┆ 9 │
│ 10 ┆ 2 ┆ A ┆ 8 │
│ 11 ┆ 1 ┆ B ┆ 9 │
└───────┴───────┴───────────┴───────┘
The with_row_index
is just used here as a visual guide to help see the removed rows. (index 8, 9)