I've used Polars for some time now but this is something that often makes me go from Polars DataFrames to native Python calculations. I've spent resonable time looking for solutions that (tries) to use shift(), rolling(), group_by_dynamic() and so on but none is successful.
Do calculation that depends on previous calculation's result that is in the same column.
In Excel this is like the most straighforward formula ever...if the "index" is zero I want to return "A", otherwise I want to return the result from the cell above.
A | B | C | |
---|---|---|---|
1 | Index | Result | Formula for the "Result" column |
2 | 0 | A | =IF(A2=0;"A";B1) |
3 | 1 | A | =IF(A3=0;"A";B2) |
In column "B" the formula refers to the previously calculated values on the same column "B".
# Import Polars module.
import polars as pl
# Create the data.
data = {'Index': [0, 1]}
# Create the DataFrame.
df = pl.from_dict(data)
# Add a column to the DataFrame.
df = df.with_columns(
# Tries to reproduce the Excel formula.
Result = pl.when(
pl.col('Index') == 0
).then(
pl.lit('A')
).otherwise(
pl.col('Result')
)
)
Within the "with_columns()" method the "Result" column cannot be referred because It doens't exist in the DataFrame yet. If we try to do so, we get a ColumnNotFoundError:
Any idea on how can I accomplish such a simple task on Polars?
Thank you,
How can I do a calculation that depends on previous (row's) calculation result that is in the same column?
The short answer is that you can't without falling back into Python. To do this, any library would need to essentially need to iterate over the rows, only calculating a single row at a time. This means any sort of vectorisation is not possible.
Polars offers map_elements
for this use-case, but it is discouraged. From the docs:
This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.
df = pl.DataFrame({'Index': [1, 0, 1, 1, 0]})
previous_result = "Result" # hardcode the header as the intial "previous result"
def f(index):
global previous_result
out = "A" if index == 0 else previous_result
previous_result = out
return out
print(df.with_columns(Result=pl.col("Index").map_elements(f, return_dtype=pl.String)))
# shape: (5, 2)
# ┌───────┬────────┐
# │ Index ┆ Result │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═══════╪════════╡
# │ 1 ┆ Result │
# │ 0 ┆ A │
# │ 1 ┆ A │
# │ 1 ┆ A │
# │ 0 ┆ A │
# └───────┴────────┘
The better solution is to attempt to recognise any pattern that allows the computation to be done in a vectorised way. In this (likely contrived) example, it is that once Index == 0
has been seen once, the result of the remainder of the column is "A"
df.with_columns(
# If the row number is >= to the first row an index of 0 was seen
pl.when(pl.int_range(pl.len()) >= pl.arg_where(pl.col("Index") == 0).min())
.then(pl.lit("A"))
.otherwise(pl.lit("Result"))
.alias("Result")
)
# Returns the same output as above