[SOLVED] Cumulative sum that resets when turning negative/positive

Cumulative sum that resets when turning negative/positive

[enter image description here]

I am trying to add a column (column C) to my polars dataframe that counts how many times a value of one of the dataframe's columns (column A) is greater/less than the value of another column (column B). Once the value turns from less/greater to greater/less the cumulative sum should reset and start counting from 1/-1 again.

Solution

The data

I'm going to change the data in the example you provided.

df = pl.DataFrame(
    {
        "a": [11, 10, 10, 10, 9, 8, 8, 8, 8, 8, 15, 15, 15],
        "b": [11, 9, 9, 9, 9, 9, 10, 8, 8, 10, 11, 11, 15],
    }
)
print(df)

shape: (13, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 11  ┆ 11  │
│ 10  ┆ 9   │
│ 10  ┆ 9   │
│ 10  ┆ 9   │
│ 9   ┆ 9   │
│ 8   ┆ 9   │
│ 8   ┆ 10  │
│ 8   ┆ 8   │
│ 8   ┆ 8   │
│ 8   ┆ 10  │
│ 15  ┆ 11  │
│ 15  ┆ 11  │
│ 15  ┆ 15  │
└─────┴─────┘

Notice the cases where the two columns are the same. Your post didn't address what to do in these cases, so I made some assumptions as to what should happen. (You can adapt the code to handle those cases differently.)

The algorithm

df = (
    df
    .with_columns((pl.col("a") - pl.col("b")).sign().alias("sign_a_minus_b"))
    .with_columns(
        pl.when(pl.col("sign_a_minus_b") == 0)
        .then(None)
        .otherwise(pl.col("sign_a_minus_b"))
        .forward_fill()
        .alias("run_type")
    )
    .with_columns(
        (pl.col("run_type") != pl.col("run_type").shift(1, fill_value=0))
        .cum_sum()
        .alias("run_id")
    )
    .with_columns(pl.col("sign_a_minus_b").cum_sum().over("run_id").alias("result"))
)
print(df)

shape: (13, 6)
┌─────┬─────┬────────────────┬──────────┬────────┬────────┐
│ a   ┆ b   ┆ sign_a_minus_b ┆ run_type ┆ run_id ┆ result │
│ --- ┆ --- ┆ ---            ┆ ---      ┆ ---    ┆ ---    │
│ i64 ┆ i64 ┆ i64            ┆ i64      ┆ u32    ┆ i64    │
╞═════╪═════╪════════════════╪══════════╪════════╪════════╡
│ 11  ┆ 11  ┆ 0              ┆ null     ┆ null   ┆ 0      │
│ 10  ┆ 9   ┆ 1              ┆ 1        ┆ null   ┆ 1      │
│ 10  ┆ 9   ┆ 1              ┆ 1        ┆ 0      ┆ 1      │
│ 10  ┆ 9   ┆ 1              ┆ 1        ┆ 0      ┆ 2      │
│ 9   ┆ 9   ┆ 0              ┆ 1        ┆ 0      ┆ 2      │
│ 8   ┆ 9   ┆ -1             ┆ -1       ┆ 1      ┆ -1     │
│ 8   ┆ 10  ┆ -1             ┆ -1       ┆ 1      ┆ -2     │
│ 8   ┆ 8   ┆ 0              ┆ -1       ┆ 1      ┆ -2     │
│ 8   ┆ 8   ┆ 0              ┆ -1       ┆ 1      ┆ -2     │
│ 8   ┆ 10  ┆ -1             ┆ -1       ┆ 1      ┆ -3     │
│ 15  ┆ 11  ┆ 1              ┆ 1        ┆ 2      ┆ 1      │
│ 15  ┆ 11  ┆ 1              ┆ 1        ┆ 2      ┆ 2      │
│ 15  ┆ 15  ┆ 0              ┆ 1        ┆ 2      ┆ 2      │
└─────┴─────┴────────────────┴──────────┴────────┴────────┘

I've left the intermediate calculations in the output, merely to show how the algorithm works. (You can drop them.)

The basic idea is to calculate a run_id for each run of positive or negative values. We will then use the cumsum function and the over windowing expression to create a running count of positives/negatives over each run_id.

Key assumption: ties in columns a and b do not interrupt a run, but they do not contribute to the total for that run of positive/negative values.

sign_a_minus_b does two things: it identifies whether a run is positive/negative, and whether there is a tie in columns a and b.

run_type extends any run to include any cases where a tie occurs in columns a and b. The null value at the top of the column was intended - it shows what happens when a tie occurs in the first row.

result is the output column. Note that tied columns do not interrupt a run, but they don't contribute to the totals for that run.

One final note: if ties in columns a and b are not allowed, then this algorithm can be simplified ... and run faster.