pythondataframepython-polars

Python Polars recursion


I've used Polars for some time now but this is something that often makes me go from Polars DataFrames to native Python calculations. I've spent resonable time looking for solutions that (tries) to use shift(), rolling(), group_by_dynamic() and so on but none is successful.

Task

Do calculation that depends on previous calculation's result that is in the same column.

Example in Excel

In Excel this is like the most straighforward formula ever...if the "index" is zero I want to return "A", otherwise I want to return the result from the cell above.

A B C
1 Index Result Formula for the "Result" column
2 0 A =IF(A2=0;"A";B1)
3 1 A =IF(A3=0;"A";B2)

Where is the recursion

In column "B" the formula refers to the previously calculated values on the same column "B".

enter image description here

Copy & Paste Excel's solution to Polars

# Import Polars module.
import polars as pl

# Create the data.
data = {'Index': [0, 1]}

# Create the DataFrame.
df = pl.from_dict(data)

# Add a column to the DataFrame.
df = df.with_columns(

    # Tries to reproduce the Excel formula.
    Result = pl.when(
        pl.col('Index') == 0
    ).then(
        pl.lit('A')
    ).otherwise(
        pl.col('Result')
    )

)

The issue

Within the "with_columns()" method the "Result" column cannot be referred because It doens't exist in the DataFrame yet. If we try to do so, we get a ColumnNotFoundError:

enter image description here

Question

Any idea on how can I accomplish such a simple task on Polars?

Thank you,


Solution

  • How can I do a calculation that depends on previous (row's) calculation result that is in the same column?

    The short answer is that you can't without falling back into Python. To do this, any library would need to essentially need to iterate over the rows, only calculating a single row at a time. This means any sort of vectorisation is not possible.

    Polars offers map_elements for this use-case, but it is discouraged. From the docs:

    This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.

    df = pl.DataFrame({'Index': [1, 0, 1, 1, 0]})
    
    
    previous_result = "Result" # hardcode the header as the intial "previous result"
    def f(index):
        global previous_result
        out = "A" if index == 0 else previous_result
        previous_result = out
        return out
    
    print(df.with_columns(Result=pl.col("Index").map_elements(f, return_dtype=pl.String)))
    # shape: (5, 2)
    # ┌───────┬────────┐
    # │ Index ┆ Result │
    # │ ---   ┆ ---    │
    # │ i64   ┆ str    │
    # ╞═══════╪════════╡
    # │ 1     ┆ Result │
    # │ 0     ┆ A      │
    # │ 1     ┆ A      │
    # │ 1     ┆ A      │
    # │ 0     ┆ A      │
    # └───────┴────────┘
    

    The better solution is to attempt to recognise any pattern that allows the computation to be done in a vectorised way. In this (likely contrived) example, it is that once Index == 0 has been seen once, the result of the remainder of the column is "A"

    df.with_columns(
        # If the row number is >= to the first row an index of 0 was seen
        pl.when(pl.int_range(pl.len()) >= pl.arg_where(pl.col("Index") == 0).min())
        .then(pl.lit("A"))
        .otherwise(pl.lit("Result"))
        .alias("Result")
    )
    # Returns the same output as above