pythonpython-polars

Filtering selected columns based on column aggregate


I wish to select only columns with fewer than 3 unique values. I can generate a boolean mask via pl.all().n_unique() < 3, but I don't know if I can use that mask via the polars API for this.

Currently, I am solving it via python. Is there a more idiomatic way?

import polars as pl, pandas as pd
df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})
# target is:
# df_few_unique = pl.DataFrame({"col1":[1,1,2], "col3":[3,3,3]})

# my attempt:
mask = df.select(pl.all().n_unique() < 3).to_numpy()[0]
cols = [col for col, m in zip(df.columns, mask) if m]
df_few_unique = df.select(cols)
df_few_unique

Equivalent in pandas:

df_pandas = df.to_pandas()
mask = (df_pandas.nunique() < 3)
df_pandas.loc[:, mask]

Solution

  • The selected answer, though syntactically clean, is inefficient. You can do about better

    Let us first include at least two filters rather than just one

    Problem: Select only those columns where the number of unique values is between 1 and 200

    The thing to consider is that you would need a pass over the data no matter what. So, reading it in is the first step

    Then, if you do

    pl.select(
        [s for s in df
         if s.n_unique() < 200 and s.n_unique() > 1]
    )
    

    You are computing the filters in sequence and also keeping them in memory. Htop confirms that using just one core of the machine The ideal solution is to do it all in parallel.

    Let us do a few benchmarks. I am using a 32 cores machine. Parallelism would reduce the time further on machines with more cores

    set up the dataframes:

    import polars as pl
    import numpy as np
    df = pl.DataFrame({f'a_{i}':np.random.choice(['a','b','c','d'], 10000000) for i in range(100)})
    

    This would take up about 20 GiB RAM. So, be careful if you want to replicate

    Selected solution (htop confirms that this solution uses only one core)

    %%timeit
    _df = pl.select(
        [s for s in df
         if s.n_unique() < 200 and s.n_unique() > 1]
    )
    output:
    18.7 s ± 92.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Let us now try to run the filters in parallel (htop confirms)

    %%timeit
    _df = df.select((pl.all().n_unique() < 200) & (pl.all().n_unique() > 1))
    output:
    1.35 s ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    We are still computing every filter twice in the two .n_unique() calls above. Let us do with just one by using in_between (parallel execution - htop confirms)

    %%timeit
    _df = df.select((pl.all().n_unique().is_between(1,200)))
    output:
    708 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Btw, if you don't want to remember the APIs like in_between and also not compute the n_unique() twice, you can use the lazy semantics

    df_lazy = df.lazy()
    

    Now, try the above solution

    %%timeit
    _df = df_lazy.select((pl.all().n_unique() < 200) & (pl.all().n_unique() > 1)).collect()
    output:
    718 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)