lazy-evaluationpython-polars

Select columns in a Polars LazyFrame based on a condition without collect?


We often want to remove columns from a LazyFrame that don't fit a condition or threshold evaluated over that column (variance, number of missing values, number of unique values). It's possible to evaluate a condition over a LazyFrame columnwise, collect that condition, and pass it as a list to the same LazyFrame (see this question). Is it possible to do this without evaluating an intermediate result?

A toy example would be to select only the columns that have 10 or more unique values. I can do this following the example from the linked question:

threshold = 10
df = ldf.select(
    ldf.select(pl.all().n_unique())
    .unpivot()
    .filter(pl.col("value") >= threshold)
    .select("variable")
    .collect() # this evaluates the condition over the dataframe
    .to_series()
    .to_list()
).collect()

I would like to do this with only one collect() statement at the end.


Solution

  • This is impossible without a collect. With LazyFrames you are making a computation graph. Every node in that graph has a known schema that is defined before running the query.

    It is impossible to know what the schema is if the columns you select are dependent on the "running" the query.

    In short, you have to collect and then continue lazy from that point.