I have polars dataframe:
df = pl.DataFrame({
'col1': [["aaa", "aaa"], ["bbb", "ccc"], ["ccc", "ddd", "ddd"], ["ddd", "ddd", "ddd"]],
'col2': ["a", "a", "a", "a"],
'col3': ["x", "x", "y", "y"]
})
I want to groupby col2, col3 and aggregate col1 into Set[String]
(df
.group_by("col2", "col3")
.agg(pl.col("col1").flatten().map_elements(set).alias("result"))
)
when I run it on 17 milion records then it performs very slow. after 10minutes it still does not complete.
how to speed it up?
EDIT:
This is how I solved it and it is blazing fast:
df = (
df
.with_columns(
pl.col("col1").list.join(",")
)
.group_by("col2", "col3")
.agg(
pl.col("col1").alias("col1")
)
.with_columns(
pl.col("col1").list.join(",")
)
.with_columns(
pl.col("col1").str.split(",").list.unique().alias("col1")
)
)
┌──────┬──────┬───────────────────────┐
│ col2 ┆ col3 ┆ col1 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[str] │
╞══════╪══════╪═══════════════════════╡
│ a ┆ x ┆ ["aaa", "bbb", "ccc"] │
│ a ┆ y ┆ ["ccc", "ddd"] │
└──────┴──────┴───────────────────────┘
This is the final solution I came with: it uses polars native functions and it blazing fast.
preprocess with join list[str] column into string with separator >> aggregate string column into list[string] >> postprocess: first repeat join list[str] column into string with separator so then split string on the separator and get unique values.
df = (
df
.with_columns(
pl.col("col1").list.join(",")
)
.group_by("col2", "col3")
.agg(
pl.col("col1").alias("col1")
)
.with_columns(
pl.col("col1").list.join(",")
)
.with_columns(
pl.col("col1").str.split(",").list.unique().alias("col1")
)
)
this is the output:
┌──────┬──────┬───────────────────────┐
│ col2 ┆ col3 ┆ col1 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[str] │
╞══════╪══════╪═══════════════════════╡
│ a ┆ x ┆ ["aaa", "bbb", "ccc"] │
│ a ┆ y ┆ ["ccc", "ddd"] │
└──────┴──────┴───────────────────────┘