pythonpython-polars

Polars groupby concat on multiple cols returning a list of unique values


Have a polars dataframe with the following data.

df = pl.DataFrame({"small": ["Apple Inc.", "Baidu Inc.", "Chevron Global", "Chevron Global", "Apple Inc."], "person": [10, 20, 30, 10, 10], "comp_big": ["Apple", "Baidu", "Chevron", "Chevron", "Apple"]})

Able to groupby on person, but this returns a dataframe of 2 lists.

(
    df
    .group_by("person")
    .agg(
        pl.col("small", "comp_big").unique()
    )
)

Returns

Groupby Result

This is close, but want to merge the result between small and comp_big into a single list.

(
    df
    .group_by("person")
    .agg(
        pl.col("small") + "|" + pl.col("comp_big")
    )
)

This combines both into a single list, but I would want to split on the pipe and get the unique values.

How can this be achieved?


Solution

  • You can use pl.concat_list() and then .flatten()

    (
        df
        .group_by("person")
        .agg(
            pl.concat_list("small", "comp_big")
              .flatten()
              .unique()
        )
    )
    
    shape: (3, 2)
    ┌────────┬───────────────────────────────────┐
    │ person ┆ small                             │
    │ ---    ┆ ---                               │
    │ i64    ┆ list[str]                         │
    ╞════════╪═══════════════════════════════════╡
    │ 20     ┆ ["Baidu", "Baidu Inc."]           │
    │ 30     ┆ ["Chevron", "Chevron Global"]     │
    │ 10     ┆ ["Chevron", "Apple Inc.", … "Che… │
    └────────┴───────────────────────────────────┘