I'm trying to account for room usage only during business hours and abridge an event duration if it runs past the end of business hours.
I have a dataframe like this:
import polars as pl
from datetime import datetime
df = pl.DataFrame({
'name': 'foo',
'start': datetime.fromisoformat('2025-01-01 08:00:00'),
'end': datetime.fromisoformat('2025-01-01 18:00:00'), # ends after business hours
'business_end': datetime.fromisoformat('2025-01-01 17:00:00')
})
I want to create a duration column that is equal to end
unless it's after business_end
otherwise set to business_end
. For this, I tried the following:
df.with_columns(
duration=pl.col("end") - pl.col("start")
if pl.col("end") <= pl.col("business_end")
else pl.col("business_end") - pl.col("start")
)
This gives an error:
TypeError: the truth value of an Expr is ambiguous
Thoughts about how to produce the desired row from the conditional?
I can use filter()
to find rows where event ends are after business ends, create a frame of those, replace the end time value, merge back in, etc. but I was hoping to keep the original data and only add a new column.
You use when/then/otherwise instead of if else
df.with_columns(
duration=pl.when(pl.col("end") <= pl.col("business_end"))
.then(pl.col("end") - pl.col("start"))
.otherwise(pl.col("business_end") - pl.col("start"))
)
polars works with expressions inside contexts. What's that mean?
Contexts are your with_columns
, select
, group_by
, agg
, etc.
The inputs to contexts are expressions. Expressions usually start with pl.col()
or pl.lit()
. They have lots of methods which also return expressions which makes them chainable.
The thing about expressions is that they don't have values, they're just instructions. One way to see that clearly is to assign an expression to a normal variable like end=pl.col("end")
. You can do that without any DataFrames existing. Once you have a df, you can use that expr in its context df.select(end)
. When the select
context gets the expression pl.col("end")
, that's when it'll go fetch the column. You could also make a more complicated expression like my_sum = (pl.col("a") * 2 + pl.col("b").pow(3))
and then even chain off of it df.select(my_sum*2+5)
Now getting back to the if
, because pl.col("end") doesn't have any values associated with it, python can't evaluate if pl.col("end") <= pl.col("other")
which is why you're getting that error. python doesn't have an overload for if
so you just can't use it inside a context.
Instead you can use the when
then
otherwise
construct.