pythondataframewindow-functionspython-polars

Performing cumulative sum in a Window with different columns ordering and null last configuration in Polars


Polars version: 1.25.2

I have a dataframe:

from datetime import date

test_df = pl.DataFrame([
    ("A", None, date(2009, 1, 24), 1),
    ("A", date(2010, 3, 24), date(2013, 1, 24), 1),
    ("A", date(2018, 11, 26), None, 1)
], schema=["col1", "col2", "col3", "col4"])

This is the output that I want:

┌──────┬────────────┬────────────┬──────┬──────┐
│ col1 ┆ col2       ┆ col3       ┆ col4 ┆ col5 │
│ ---  ┆ ---        ┆ ---        ┆ ---  ┆ ---  │
│ str  ┆ date       ┆ date       ┆ i64  ┆ i64  │
╞══════╪════════════╪════════════╪══════╪══════╡
│ A    ┆ null       ┆ 2009-01-24 ┆ 1    ┆ 3    │
│ A    ┆ 2010-03-24 ┆ 2013-01-24 ┆ 1    ┆ 2    │
│ A    ┆ 2018-11-26 ┆ null       ┆ 1    ┆ 1    │
└──────┴────────────┴────────────┴──────┴──────┘

I tried the following but not working:

print(test_df.with_columns(
    pl.col("col4").sum().over(
        partition_by=["col1"], 
        order_by=[pl.col("col2").sort(descending=True, nulls_last=True), pl.col("col3").sort(descending=True, nulls_last=False)]
    ).alias("col5")
))

┌──────┬────────────┬────────────┬──────┬──────┐
│ col1 ┆ col2       ┆ col3       ┆ col4 ┆ col5 │
│ ---  ┆ ---        ┆ ---        ┆ ---  ┆ ---  │
│ str  ┆ date       ┆ date       ┆ i64  ┆ i64  │
╞══════╪════════════╪════════════╪══════╪══════╡
│ A    ┆ null       ┆ 2009-01-24 ┆ 1    ┆ 3    │
│ A    ┆ 2010-03-24 ┆ 2013-01-24 ┆ 1    ┆ 3    │
│ A    ┆ 2018-11-26 ┆ null       ┆ 1    ┆ 3    │
└──────┴────────────┴────────────┴──────┴──────┘

I know I can just sort it by col3 in descending order with null lasts to False. However, this is just a sample data and the real dataset is more complex. I'm wondering how to pass different columns setting in a Window.


Edit 1 on 2025-12-11

I realized that it doesn't work not because of using sum() instead of cum_sum(). I think it goes back to my original question. It works accidentally because dataframe has already sorted when I create the dataframe. If I change the order to something like:

test_df = pl.DataFrame([
    ("A", None, date(2009, 1, 24), 1),
    ("A", date(2018, 11, 26), None, 1),
    ("A", date(2010, 3, 24), date(2013, 1, 24), 1),
], schema=["col1", "col2", "col3", "col4"])

┌──────┬────────────┬────────────┬──────┐
│ col1 ┆ col2       ┆ col3       ┆ col4 │
│ ---  ┆ ---        ┆ ---        ┆ ---  │
│ str  ┆ date       ┆ date       ┆ i64  │
╞══════╪════════════╪════════════╪══════╡
│ A    ┆ null       ┆ 2009-01-24 ┆ 1    │
│ A    ┆ 2018-11-26 ┆ null       ┆ 1    │
│ A    ┆ 2010-03-24 ┆ 2013-01-24 ┆ 1    │
└──────┴────────────┴────────────┴──────┘

This time when I apply the same logic:

print(test_df.with_columns(
    pl.col("col4").cum_sum().over(
        partition_by=["col1"], 
        order_by=[pl.col("col2").sort(descending=True, nulls_last=True), pl.col("col3").sort(descending=True, nulls_last=False)]
    ).alias("col5")
))

┌──────┬────────────┬────────────┬──────┬──────┐
│ col1 ┆ col2       ┆ col3       ┆ col4 ┆ col5 │
│ ---  ┆ ---        ┆ ---        ┆ ---  ┆ ---  │
│ str  ┆ date       ┆ date       ┆ i64  ┆ i64  │
╞══════╪════════════╪════════════╪══════╪══════╡
│ A    ┆ null       ┆ 2009-01-24 ┆ 1    ┆ 3    │
│ A    ┆ 2018-11-26 ┆ null       ┆ 1    ┆ 2    │
│ A    ┆ 2010-03-24 ┆ 2013-01-24 ┆ 1    ┆ 1    │
└──────┴────────────┴────────────┴──────┴──────┘

Initially I thought it's because of the window function, but when I try to sort the dataframe only, it still doesn't work:

print(test_df.sort(
    by=["col1", pl.col("col2").sort(descending=True, nulls_last=True), pl.col("col3").sort(descending=True, nulls_last=False)]
))

┌──────┬────────────┬────────────┬──────┐
│ col1 ┆ col2       ┆ col3       ┆ col4 │
│ ---  ┆ ---        ┆ ---        ┆ ---  │
│ str  ┆ date       ┆ date       ┆ i64  │
╞══════╪════════════╪════════════╪══════╡
│ A    ┆ 2010-03-24 ┆ 2013-01-24 ┆ 1    │
│ A    ┆ 2018-11-26 ┆ null       ┆ 1    │
│ A    ┆ null       ┆ 2009-01-24 ┆ 1    │
└──────┴────────────┴────────────┴──────┘

# Expected sorting result
┌──────┬────────────┬────────────┬──────┐
│ col1 ┆ col2       ┆ col3       ┆ col4 │
│ ---  ┆ ---        ┆ ---        ┆ ---  │
│ str  ┆ date       ┆ date       ┆ i64  │
╞══════╪════════════╪════════════╪══════╡
│ A    ┆ null       ┆ 2009-01-24 ┆ 1    │
│ A    ┆ 2010-03-24 ┆ 2013-01-24 ┆ 1    │
│ A    ┆ 2018-11-26 ┆ null       ┆ 1    │
└──────┴────────────┴────────────┴──────┘

Solution

  • I tried this code and it worked for me, what I did first is sorting the dataframe using sort and then calculate the cumulative sum over the group only and then restore the original row order

    import polars as pl
    from datetime import date
    
    test_df = pl.DataFrame(
        [
            ("A", None, date(2009, 1, 24), 1),
            ("A", date(2010, 3, 24), date(2013, 1, 24), 1),
            ("A", date(2018, 11, 26), None, 1),
        ],
        schema=["col1", "col2", "col3", "col4"],
        orient="row",
    )
    
    result = (
        test_df.with_row_index("idx").sort(
            by=["col1", "col2", "col3"],
            descending=[False, True, True],
            nulls_last=[False, True, False],
        ).with_columns(
            pl.col("col4").cum_sum().over("col1").alias("col5")
        ).sort("idx").drop("idx")
    )
    
    print(result)
    

    enter image description here