In Python Polars, I have a dataframe like the below:
df = pl.DataFrame(
{"sets": [[1, 2, 3], [1, 2], [9, 10]], "optional_members": [[1, 2, 3], [1, 2], [9, 0]]}
)
shape: (2, 2)
┌───────────┬──────────────────┐
│ sets ┆ optional_members │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═══════════╪══════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3] │
│ [1, 0] ┆ [1, 2] │
└───────────┴──────────────────┘
I would like to build an expression that gets me the elements of the first column that are in the second, keeping the shape of the former, i.e:
shape: (2, 3)
┌───────────┬──────────────────┬─────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪═════════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3] ┆ [true, false, true] │
│ [1, 0] ┆ [1, 2] ┆ [true, false] │
└───────────┴──────────────────┴─────────────────────┘
I have tried using eval over the first list, something like:
func = lambda x, y: y.list.contains(x)
df.with_columns(contains=
pl.col("optional_members")
.list.
eval(func(pl.element(), pl.col("optional_members"))))
But the pl.col()
expression cannot be in an eval.
How could we aaddress this while keeping the solution in a single expression?
Thanks to @roman comment, a point need to be made: the check should be done regardless of the position.
If you need to compare elements on the same position:
df.with_columns(
(pl.col.sets.list.explode() == pl.col.optional_members.list.explode())
.implode()
.over(pl.int_range(pl.len()))
.alias("result")
)
shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪════════════════════╡
│ [1, 2, 3] ┆ [1, 2, 3] ┆ [true, true, true] │
│ [1, 2] ┆ [1, 2] ┆ [true, true] │
│ [9, 10] ┆ [9, 0] ┆ [true, false] │
└───────────┴──────────────────┴────────────────────┘
If you need to compare elements regardless of position then it's a bit more complicated:
df.with_columns(
pl.col.sets.explode().is_in(pl.col.optional_members.explode())
.implode()
.over(pl.int_range(pl.len()))
.alias("result")
)
shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪════════════════════╡
│ [1, 2, 3] ┆ [1, 2, 3] ┆ [true, true, true] │
│ [1, 2] ┆ [1, 2] ┆ [true, true] │
│ [9, 10] ┆ [9, 0] ┆ [true, false] │
└───────────┴──────────────────┴────────────────────┘
If your lists are not very long or if all the lists are the same length, you can also try to use
pl.Expr.list.get()
.m = df.select(pl.col.sets.list.len().max()).item()
df.with_columns(
pl.concat_list(
pl.col.optional_members.list.contains(pl.col.sets.list.get(i, null_on_oob=True))
for i in range(m)
).list.head(pl.col.sets.list.len())
.alias("result")
)
shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪════════════════════╡
│ [1, 2, 3] ┆ [1, 2, 3] ┆ [true, true, true] │
│ [1, 2] ┆ [1, 2] ┆ [true, true] │
│ [9, 10] ┆ [9, 0] ┆ [true, false] │
└───────────┴──────────────────┴────────────────────┘