pythondataframefilterpython-polars

Filter polars DataFrame given an external condition


Is there a way to filter a polars df given an external variable? I currently do it this way:

_df = (_df.with_columns(...)...)
_variable = 1
if _variable==1:
    _df = _df.filter(pl.col("date")=="2020-02-01")
else:
    _df = _df.filter(pl.col("seller")="Markets")

Is there a way to filter this way inside the first piece of code? like:

_variable = 1
_df = (_df
       .with_columns(...)
       .filter(pl.col("date")=="2020-02-01") if _variable==1 else 
       .filter(pl.col("seller")="Markets")
       .sort(...)
       ...)

I ask this because I read that polars works better if you do everything at once instead of working as if it was a pandas dataframe


Solution

  • There's an implicit question here which is, is it better to chain afilter operation rather than set a df after a with_columns? Not really.

    I read that polars works better if you do everything at once instead of working as if it was a pandas dataframe

    It's true that in polars we don't want to do something like this

    df = df.with_columns(c=pl.col('a') + pl.col('b'))
    df = df.with_columns(d=pl.col('a') * pl.col('b'))
    

    We just want to do

    df = df.with_columns(
             c=pl.col('a') + pl.col('b'),
             d=pl.col('a') * pl.col('b')
          )
    

    The reason is that every expression that you give it at once, it'll run in parallel. If you chain multiple select or with_columns you're making it do each of those one after another rather than in parallel.

    Bringing this back to the filter question, there's not a penalty for doing things like

    _df = (_df.with_columns(...)...)
    _variable = 1
    if _variable==1:
        _df = _df.filter(pl.col("date")=="2020-02-01")
    else:
        _df = _df.filter(pl.col("seller")="Markets")
    

    instead of doing all the operations in a single chain which would look like this if you wanted to do it this way anyway

    _variable = 1
    _df = (_df
           .with_columns(...)
           .filter((pl.col("date")=="2020-02-01") if _variable==1 else 
                   (pl.col("seller")=="Markets"))
           .sort(...)
           ...)
    

    The trick is that you do it inside the filter, just make sure to put your filter expressions in parenthesis.