Is there a way to filter a polars df given an external variable? I currently do it this way:
_df = (_df.with_columns(...)...)
_variable = 1
if _variable==1:
_df = _df.filter(pl.col("date")=="2020-02-01")
else:
_df = _df.filter(pl.col("seller")="Markets")
Is there a way to filter this way inside the first piece of code? like:
_variable = 1
_df = (_df
.with_columns(...)
.filter(pl.col("date")=="2020-02-01") if _variable==1 else
.filter(pl.col("seller")="Markets")
.sort(...)
...)
I ask this because I read that polars works better if you do everything at once instead of working as if it was a pandas dataframe
There's an implicit question here which is, is it better to chain afilter operation rather than set a df after a with_columns? Not really.
I read that polars works better if you do everything at once instead of working as if it was a pandas dataframe
It's true that in polars we don't want to do something like this
df = df.with_columns(c=pl.col('a') + pl.col('b'))
df = df.with_columns(d=pl.col('a') * pl.col('b'))
We just want to do
df = df.with_columns(
c=pl.col('a') + pl.col('b'),
d=pl.col('a') * pl.col('b')
)
The reason is that every expression that you give it at once, it'll run in parallel. If you chain multiple select or with_columns you're making it do each of those one after another rather than in parallel.
Bringing this back to the filter question, there's not a penalty for doing things like
_df = (_df.with_columns(...)...)
_variable = 1
if _variable==1:
_df = _df.filter(pl.col("date")=="2020-02-01")
else:
_df = _df.filter(pl.col("seller")="Markets")
instead of doing all the operations in a single chain which would look like this if you wanted to do it this way anyway
_variable = 1
_df = (_df
.with_columns(...)
.filter((pl.col("date")=="2020-02-01") if _variable==1 else
(pl.col("seller")=="Markets"))
.sort(...)
...)
The trick is that you do it inside the filter, just make sure to put your filter expressions in parenthesis.