pythonpython-polars

Create column from other columns created within same `with_columns` context


Here, column "AB" is just being created and at the same time is being used as input to create column "ABC". This fails.

df = df.with_columns(
  (pl.col("A")+pl.col("B")).alias("AB"),
  (pl.col("AB")+pl.col("C")).alias("ABC")
) 

The only way to achieve the desired result is a second call to with_columns.

df1 = df.with_columns(
  (pl.col("A")+pl.col("B")).alias("AB")
)
df2 = df1.with_columns(
  (pl.col("AB")+pl.col("C")).alias("ABC")
) 

Solution

  • Underlying Problem

    In general, all expressions within a (with_columns, select, filter, group_by) context are evaluated in parallel. Especially, there are no columns previously created within the same context.

    Solution

    Still, you can avoid writing large expressions multiple times, by saving the expression to a variable.

    import polars as pl
    
    df = pl.DataFrame({
        "a": [1],
        "b": [2],
        "c": [3],
    })
    
    ab_expr = pl.col("a") + pl.col("b")
    df.with_columns(
        ab_expr.alias("ab"),
        (ab_expr + pl.col("c")).alias("abc"),
    )
    
    shape: (1, 5)
    ┌─────┬─────┬─────┬─────┬─────┐
    │ a   ┆ b   ┆ c   ┆ ab  ┆ abc │
    │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
    │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
    ╞═════╪═════╪═════╪═════╪═════╡
    │ 1   ┆ 2   ┆ 3   ┆ 3   ┆ 6   │
    └─────┴─────┴─────┴─────┴─────┘
    

    Note that polar's query plan optimization accounts for the joint sub-plan and the computation doesn't necessarily happen twice. This can be checked as follows.

    ab_expr = pl.col("a") + pl.col("b")
    (
        df
        .lazy()
        .with_columns(
            ab_expr.alias("ab"),
            (ab_expr + pl.col("c")).alias("abc"),
        )
        .explain()
    )
    
    simple π 5/6 ["a", "b", "c", "ab", "abc"]
       WITH_COLUMNS:
       [col("__POLARS_CSER_0xd4acad4332698399").alias("ab"), [(col("__POLARS_CSER_0xd4acad4332698399")) + (col("c"))].alias("abc")] 
         WITH_COLUMNS:
         [[(col("a")) + (col("b"))].alias("__POLARS_CSER_0xd4acad4332698399")] 
          DF ["a", "b", "c"]; PROJECT */3 COLUMNS
    

    Especially, polars is aware of the sub-plan __POLARS_CSER_0xd4acad4332698399 shared between expressions.

    Syntacic Sugar (?)

    Moreover, the walrus operation might be used to do the variable assignment within the context.

    df.with_columns(
        (ab_expr := pl.col("a") + pl.col("b")).alias("ab"),
        (ab_expr + pl.col("c")).alias("abc"),
    )