Here, column "AB" is just being created and at the same time is being used as input to create column "ABC". This fails.
df = df.with_columns(
(pl.col("A")+pl.col("B")).alias("AB"),
(pl.col("AB")+pl.col("C")).alias("ABC")
)
The only way to achieve the desired result is a second call to with_columns
.
df1 = df.with_columns(
(pl.col("A")+pl.col("B")).alias("AB")
)
df2 = df1.with_columns(
(pl.col("AB")+pl.col("C")).alias("ABC")
)
In general, all expressions within a (with_columns
, select
, filter
, group_by
) context are evaluated in parallel. Especially, there are no columns previously created within the same context.
Still, you can avoid writing large expressions multiple times, by saving the expression to a variable.
import polars as pl
df = pl.DataFrame({
"a": [1],
"b": [2],
"c": [3],
})
ab_expr = pl.col("a") + pl.col("b")
df.with_columns(
ab_expr.alias("ab"),
(ab_expr + pl.col("c")).alias("abc"),
)
shape: (1, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ ab ┆ abc │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 3 ┆ 6 │
└─────┴─────┴─────┴─────┴─────┘
Note that polar's query plan optimization accounts for the joint sub-plan and the computation doesn't necessarily happen twice. This can be checked as follows.
ab_expr = pl.col("a") + pl.col("b")
(
df
.lazy()
.with_columns(
ab_expr.alias("ab"),
(ab_expr + pl.col("c")).alias("abc"),
)
.explain()
)
simple π 5/6 ["a", "b", "c", "ab", "abc"]
WITH_COLUMNS:
[col("__POLARS_CSER_0xd4acad4332698399").alias("ab"), [(col("__POLARS_CSER_0xd4acad4332698399")) + (col("c"))].alias("abc")]
WITH_COLUMNS:
[[(col("a")) + (col("b"))].alias("__POLARS_CSER_0xd4acad4332698399")]
DF ["a", "b", "c"]; PROJECT */3 COLUMNS
Especially, polars is aware of the sub-plan __POLARS_CSER_0xd4acad4332698399
shared between expressions.
Moreover, the walrus operation might be used to do the variable assignment within the context.
df.with_columns(
(ab_expr := pl.col("a") + pl.col("b")).alias("ab"),
(ab_expr + pl.col("c")).alias("abc"),
)