pythonpython-polars

Polars-Python: Compare list columns


In Python Polars, I have a dataframe like the below:

df = pl.DataFrame(
    {"sets": [[1, 2, 3], [1, 2], [9, 10]], "optional_members": [[1, 2, 3], [1, 2], [9, 0]]}
)

shape: (2, 2)
┌───────────┬──────────────────┐
│ sets      ┆ optional_members │
│ ---       ┆ ---              │
│ list[i64] ┆ list[i64]        │
╞═══════════╪══════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3]        │
│ [1, 0]    ┆ [1, 2]           │
└───────────┴──────────────────┘

I would like to build an expression that gets me the elements of the first column that are in the second, keeping the shape of the former, i.e:


shape: (2, 3)
┌───────────┬──────────────────┬─────────────────────┐
│ sets      ┆ optional_members ┆ result              │
│ ---       ┆ ---              ┆ ---                 │
│ list[i64] ┆ list[i64]        ┆ list[bool]          │
╞═══════════╪══════════════════╪═════════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3]        ┆ [true, false, true] │
│ [1, 0]    ┆ [1, 2]           ┆ [true, false]       │
└───────────┴──────────────────┴─────────────────────┘

I have tried using eval over the first list, something like:


func = lambda x, y: y.list.contains(x)

df.with_columns(contains=
                pl.col("optional_members")
                .list.
                eval(func(pl.element(), pl.col("optional_members"))))

But the pl.col() expression cannot be in an eval.

How could we aaddress this while keeping the solution in a single expression?

Thanks to @roman comment, a point need to be made: the check should be done regardless of the position.


Solution

  • If you need to compare elements on the same position:

    df.with_columns(
        (pl.col.sets.list.explode() == pl.col.optional_members.list.explode())
        .implode()
        .over(pl.int_range(pl.len()))
        .alias("result")
    )
    
    shape: (3, 3)
    ┌───────────┬──────────────────┬────────────────────┐
    │ sets      ┆ optional_members ┆ result             │
    │ ---       ┆ ---              ┆ ---                │
    │ list[i64] ┆ list[i64]        ┆ list[bool]         │
    ╞═══════════╪══════════════════╪════════════════════╡
    │ [1, 2, 3] ┆ [1, 2, 3]        ┆ [true, true, true] │
    │ [1, 2]    ┆ [1, 2]           ┆ [true, true]       │
    │ [9, 10]   ┆ [9, 0]           ┆ [true, false]      │
    └───────────┴──────────────────┴────────────────────┘
    

    If you need to compare elements regardless of position then it's a bit more complicated:

    df.with_columns(
        pl.col.sets.explode().is_in(pl.col.optional_members.explode())
        .implode()
        .over(pl.int_range(pl.len()))
        .alias("result")
    )
    
    shape: (3, 3)
    ┌───────────┬──────────────────┬────────────────────┐
    │ sets      ┆ optional_members ┆ result             │
    │ ---       ┆ ---              ┆ ---                │
    │ list[i64] ┆ list[i64]        ┆ list[bool]         │
    ╞═══════════╪══════════════════╪════════════════════╡
    │ [1, 2, 3] ┆ [1, 2, 3]        ┆ [true, true, true] │
    │ [1, 2]    ┆ [1, 2]           ┆ [true, true]       │
    │ [9, 10]   ┆ [9, 0]           ┆ [true, false]      │
    └───────────┴──────────────────┴────────────────────┘
    

    If your lists are not very long or if all the lists are the same length, you can also try to use

    m = df.select(pl.col.sets.list.len().max()).item()
    
    df.with_columns(
        pl.concat_list(
            pl.col.optional_members.list.contains(pl.col.sets.list.get(i, null_on_oob=True))
            for i in range(m)
        ).list.head(pl.col.sets.list.len())
        .alias("result")
    )
    
    shape: (3, 3)
    ┌───────────┬──────────────────┬────────────────────┐
    │ sets      ┆ optional_members ┆ result             │
    │ ---       ┆ ---              ┆ ---                │
    │ list[i64] ┆ list[i64]        ┆ list[bool]         │
    ╞═══════════╪══════════════════╪════════════════════╡
    │ [1, 2, 3] ┆ [1, 2, 3]        ┆ [true, true, true] │
    │ [1, 2]    ┆ [1, 2]           ┆ [true, true]       │
    │ [9, 10]   ┆ [9, 0]           ┆ [true, false]      │
    └───────────┴──────────────────┴────────────────────┘