[enter image description here]
I am trying to add a column (column C) to my polars dataframe that counts how many times a value of one of the dataframe's columns (column A) is greater/less than the value of another column (column B). Once the value turns from less/greater to greater/less the cumulative sum should reset and start counting from 1/-1 again.
I'm going to change the data in the example you provided.
df = pl.DataFrame(
{
"a": [11, 10, 10, 10, 9, 8, 8, 8, 8, 8, 15, 15, 15],
"b": [11, 9, 9, 9, 9, 9, 10, 8, 8, 10, 11, 11, 15],
}
)
print(df)
shape: (13, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 11 ┆ 11 │
│ 10 ┆ 9 │
│ 10 ┆ 9 │
│ 10 ┆ 9 │
│ 9 ┆ 9 │
│ 8 ┆ 9 │
│ 8 ┆ 10 │
│ 8 ┆ 8 │
│ 8 ┆ 8 │
│ 8 ┆ 10 │
│ 15 ┆ 11 │
│ 15 ┆ 11 │
│ 15 ┆ 15 │
└─────┴─────┘
Notice the cases where the two columns are the same. Your post didn't address what to do in these cases, so I made some assumptions as to what should happen. (You can adapt the code to handle those cases differently.)
df = (
df
.with_columns((pl.col("a") - pl.col("b")).sign().alias("sign_a_minus_b"))
.with_columns(
pl.when(pl.col("sign_a_minus_b") == 0)
.then(None)
.otherwise(pl.col("sign_a_minus_b"))
.forward_fill()
.alias("run_type")
)
.with_columns(
(pl.col("run_type") != pl.col("run_type").shift(1, fill_value=0))
.cum_sum()
.alias("run_id")
)
.with_columns(pl.col("sign_a_minus_b").cum_sum().over("run_id").alias("result"))
)
print(df)
shape: (13, 6)
┌─────┬─────┬────────────────┬──────────┬────────┬────────┐
│ a ┆ b ┆ sign_a_minus_b ┆ run_type ┆ run_id ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ u32 ┆ i64 │
╞═════╪═════╪════════════════╪══════════╪════════╪════════╡
│ 11 ┆ 11 ┆ 0 ┆ null ┆ null ┆ 0 │
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ null ┆ 1 │
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 0 ┆ 1 │
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 0 ┆ 2 │
│ 9 ┆ 9 ┆ 0 ┆ 1 ┆ 0 ┆ 2 │
│ 8 ┆ 9 ┆ -1 ┆ -1 ┆ 1 ┆ -1 │
│ 8 ┆ 10 ┆ -1 ┆ -1 ┆ 1 ┆ -2 │
│ 8 ┆ 8 ┆ 0 ┆ -1 ┆ 1 ┆ -2 │
│ 8 ┆ 8 ┆ 0 ┆ -1 ┆ 1 ┆ -2 │
│ 8 ┆ 10 ┆ -1 ┆ -1 ┆ 1 ┆ -3 │
│ 15 ┆ 11 ┆ 1 ┆ 1 ┆ 2 ┆ 1 │
│ 15 ┆ 11 ┆ 1 ┆ 1 ┆ 2 ┆ 2 │
│ 15 ┆ 15 ┆ 0 ┆ 1 ┆ 2 ┆ 2 │
└─────┴─────┴────────────────┴──────────┴────────┴────────┘
I've left the intermediate calculations in the output, merely to show how the algorithm works. (You can drop them.)
The basic idea is to calculate a run_id
for each run of positive or negative values. We will then use the cumsum
function and the over
windowing expression to create a running count of positives/negatives over each run_id
.
Key assumption: ties in columns a
and b
do not interrupt a run, but they do not contribute to the total for that run of positive/negative values.
sign_a_minus_b
does two things: it identifies whether a run is positive/negative, and whether there is a tie in columns a
and b
.
run_type
extends any run to include any cases where a tie occurs in columns a
and b
. The null
value at the top of the column was intended - it shows what happens when a tie occurs in the first row.
result
is the output column. Note that tied columns do not interrupt a run, but they don't contribute to the totals for that run.
One final note: if ties in columns a
and b
are not allowed, then this algorithm can be simplified ... and run faster.