pythonsetaggregateflattenpython-polars

polars aggregate list[str] column into set[str]


I have polars dataframe:

df = pl.DataFrame({
    'col1': [["aaa", "aaa"], ["bbb", "ccc"], ["ccc", "ddd", "ddd"], ["ddd", "ddd", "ddd"]],
    'col2': ["a", "a", "a", "a"],
    'col3': ["x", "x", "y", "y"]
})

I want to groupby col2, col3 and aggregate col1 into Set[String]

(df
.group_by("col2", "col3")
.agg(pl.col("col1").flatten().map_elements(set).alias("result"))
)

when I run it on 17 milion records then it performs very slow. after 10minutes it still does not complete.

how to speed it up?

EDIT:

This is how I solved it and it is blazing fast:

df = (
   df
   .with_columns(
       pl.col("col1").list.join(",")
   )
   .group_by("col2", "col3")
   .agg(
       pl.col("col1").alias("col1")
   )
   .with_columns(
       pl.col("col1").list.join(",")
   )
   .with_columns(
       pl.col("col1").str.split(",").list.unique().alias("col1")
   )
)
┌──────┬──────┬───────────────────────┐
│ col2 ┆ col3 ┆ col1                  │
│ ---  ┆ ---  ┆ ---                   │
│ str  ┆ str  ┆ list[str]             │
╞══════╪══════╪═══════════════════════╡
│ a    ┆ x    ┆ ["aaa", "bbb", "ccc"] │
│ a    ┆ y    ┆ ["ccc", "ddd"]        │
└──────┴──────┴───────────────────────┘

Solution

  • This is the final solution I came with: it uses polars native functions and it blazing fast.

    preprocess with join list[str] column into string with separator >> aggregate string column into list[string] >> postprocess: first repeat join list[str] column into string with separator so then split string on the separator and get unique values.

    df = (
       df
       .with_columns(
           pl.col("col1").list.join(",")
       )
       .group_by("col2", "col3")
       .agg(
           pl.col("col1").alias("col1")
       )
       .with_columns(
           pl.col("col1").list.join(",")
       )
       .with_columns(
           pl.col("col1").str.split(",").list.unique().alias("col1")
       )
    )
    

    this is the output:

    ┌──────┬──────┬───────────────────────┐
    │ col2 ┆ col3 ┆ col1                  │
    │ ---  ┆ ---  ┆ ---                   │
    │ str  ┆ str  ┆ list[str]             │
    ╞══════╪══════╪═══════════════════════╡
    │ a    ┆ x    ┆ ["aaa", "bbb", "ccc"] │
    │ a    ┆ y    ┆ ["ccc", "ddd"]        │
    └──────┴──────┴───────────────────────┘