pythondataframepython-polars

Polars arg_unique for list column


How can I obtain the (first occurence) indices of unique elements for a column of type list in polars dataframe? I am looking for something similar to arg_unique, but that only exists for pl.Series, such as to be performed over a whole column. I need this to work one level below that, so on every list that is inside the column. Given the dataframe

df = pl.DataFrame({
    "fruits": [["apple", "banana", "apple", "orange"], ["grape", "apple", "grape"], ["kiwi", "mango", "kiwi"]]
})

I expect the output to be

df = pl.DataFrame({
    "fruits": [[0, 1, 3], [0, 1], [0, 1]]
})

Solution

  • .list.eval() can be used as a fallback when there is no specific .list.* method currently implemented.

    df.with_columns(
        pl.col("fruits").list.eval(pl.element().arg_unique()).alias("idxs")
    )
    
    shape: (3, 2)
    ┌────────────────────────────────────────┬───────────┐
    │ fruits                                 ┆ idxs      │
    │ ---                                    ┆ ---       │
    │ list[str]                              ┆ list[u32] │
    ╞════════════════════════════════════════╪═══════════╡
    │ ["apple", "banana", "apple", "orange"] ┆ [0, 1, 3] │
    │ ["grape", "apple", "grape"]            ┆ [0, 1]    │
    │ ["kiwi", "mango", "kiwi"]              ┆ [0, 1]    │
    └────────────────────────────────────────┴───────────┘