When I use horizontal cumulative sum followed by unnest, a "literal" column is formed that stays in the schema even when dropped.
Here is an example:
import polars as pl
def test_literal_bug():
print("Polars version:", pl.__version__)
# Create simple test data
df = pl.DataFrame({
"A": [1, 2, 3],
"T0": [0.1, 0.2, 0.3],
"T1": [0.4, 0.5, 0.6],
"T2": [0.7, 0.8, 0.9],
})
time_cols = ["T0", "T1", "T2"]
print("Original columns:", df.columns)
print("Time columns:", time_cols)
lazy_df = df.lazy()
print("Schema before cumsum:", lazy_df.collect_schema().names())
result = (
lazy_df.select(pl.cum_sum_horizontal(time_cols))
.unnest("cum_sum")
.rename({col: f"C{col}" for col in time_cols})
)
print("Schema after cumsum:", result.collect_schema().names())
# This will fail with: ColumnNotFoundError: "literal" not found
try:
collected = result.collect()
print("v1: No bug reproduced")
except pl.exceptions.ColumnNotFoundError as e:
print(f"v1: BUG REPRODUCED: {e}")
result_2 = result.drop("literal")
result_2 = pl.concat([pl.LazyFrame({"B": [1, 2, 3]}), result_2], how="horizontal")
print("Schema after drop and concat:", result_2.collect_schema().names())
try:
collected_2 = result_2.collect()
print("v2: No bug reproduced")
except pl.exceptions.ColumnNotFoundError as e:
print(f"v2: BUG REPRODUCED: {e}")
if __name__ == "__main__":
test_literal_bug()
Output:
Polars version: 1.31.0
Original columns: ['A', 'T0', 'T1', 'T2']
Time columns: ['T0', 'T1', 'T2']
Schema before cumsum: ['A', 'T0', 'T1', 'T2']
Schema after cumsum: ['CT0', 'CT1', 'CT2', 'literal']
v1: BUG REPRODUCED: "literal" not found
Schema after drop and concat: ['B', 'CT0', 'CT1', 'CT2']
v2: BUG REPRODUCED: "literal" not found
What is going on? Am I doing something wrong or is it a bug?
It is a bug.
1.32.0 has been released which contains a fix: https://github.com/pola-rs/polars/pull/23686
However, I think a new "bug" may have been introduced.
lazy_df.select(pl.cum_sum_horizontal(time_cols)).collect()
# InvalidOperationError: cannot add columns: dtype was not list on all nesting levels:
# (left: list[str], right: f64)
It seems you cannot pass a list of names now and must unpack them:
lazy_df.select(pl.cum_sum_horizontal(*time_cols)).collect()
shape: (3, 1)
┌───────────────┐
│ cum_sum │
│ --- │
│ struct[3] │
╞═══════════════╡
│ {0.1,0.5,1.2} │
│ {0.2,0.7,1.5} │
│ {0.3,0.9,1.8} │
└───────────────┘
The cum_sum_horizontal source code shows it as a wrapper around cum_fold()
which does work with a list on 1.32.0
(lazy_df
.select(pl.cum_fold(0, lambda x, y: x + y, time_cols))
.unnest(pl.all())
.collect()
)
shape: (3, 3)
┌─────┬─────┬─────┐
│ T0 ┆ T1 ┆ T2 │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ 0.1 ┆ 0.5 ┆ 1.2 │
│ 0.2 ┆ 0.7 ┆ 1.5 │
│ 0.3 ┆ 0.9 ┆ 1.8 │
└─────┴─────┴─────┘