pythonpython-polars

Python-Polars: Expression to get length of lists in a struct


In Python Polars, I am trying to extract the length of the lists inside a struct to re-use it in an expression.

For example, I have the code below:

import polars as pl


df = pl.DataFrame(
    {
        "x": [0, 4],
        "y": [
            {"low": [-1, 0, 1], "up": [1, 2, 3]},
            {"low": [-2, -1, 0], "up": [0, 1, 2]},
        ],
    }
)

df.with_columns(
    check=pl.concat_list([pl.all_horizontal(
        [
            pl.col("x").ge(pl.col("y").struct["low"].list.get(i)),
            pl.col("x").le(pl.col("y").struct["up"].list.get(i)),
        ]
    ) for i in range(3)]).list.max()
)

shape: (2, 3)
┌─────┬─────────────────────────┬───────┐
│ x   ┆ y                       ┆ check │
│ --- ┆ ---                     ┆ ---   │
│ i64 ┆ struct[2]               ┆ bool  │
╞═════╪═════════════════════════╪═══════╡
│ 0   ┆ {[-1, 0, 1],[1, 2, 3]}  ┆ true  │
│ 4   ┆ {[-2, -1, 0],[0, 1, 2]} ┆ false │
└─────┴─────────────────────────┴───────┘

and I would like to infer the length of the lists in advance (i.e. not having to hardcode the 3), as it can change depending on the call.

The challenge I am facing, is that I need to include everything in the same expression context. I have tried as below, but it is not working as I cannot extract the value returned by one of the expressions:

df.with_columns(
    check=pl.concat_list([pl.all_horizontal(
        [
            pl.col("x").ge(pl.col("y").struct["low"].list.get(i)),
            pl.col("x").le(pl.col("y").struct["up"].list.get(i)),
        ]
    ) for i in range(pl.col("y").struct["low"].list.len())]).list.max()
)

Solution

  • Unfortunately, I don't see a way to use an expression for the list length here. Also, direct comparisons of list columns are not yet natively supported.

    Still, some on-the-fly exploding and imploding of the list columns could be used to achieve the desired result without relying on knowing the list lengths upfront.

    (
        df
        .with_columns(
            ge_low=(pl.col("x") >= pl.col("y").struct["low"].explode()).implode().over(pl.int_range(pl.len())),
            le_up=(pl.col("x") <= pl.col("y").struct["up"].explode()).implode().over(pl.int_range(pl.len())),
        )
        .with_columns(
            check=(pl.col("ge_low").explode() & pl.col("le_up").explode()).implode().over(pl.int_range(pl.len()))
        )
    )
    
    shape: (2, 5)
    ┌─────┬─────────────────────────┬─────────────────────┬───────────────────────┬───────────────────────┐
    │ x   ┆ y                       ┆ ge_low              ┆ le_up                 ┆ check                 │
    │ --- ┆ ---                     ┆ ---                 ┆ ---                   ┆ ---                   │
    │ i64 ┆ struct[2]               ┆ list[bool]          ┆ list[bool]            ┆ list[bool]            │
    ╞═════╪═════════════════════════╪═════════════════════╪═══════════════════════╪═══════════════════════╡
    │ 0   ┆ {[-1, 0, 1],[1, 2, 3]}  ┆ [true, true, false] ┆ [true, true, true]    ┆ [true, true, false]   │
    │ 4   ┆ {[-2, -1, 0],[0, 1, 2]} ┆ [true, true, true]  ┆ [false, false, false] ┆ [false, false, false] │
    └─────┴─────────────────────────┴─────────────────────┴───────────────────────┴───────────────────────┘