pythonpandasdataframepython-polarspercentile

How to calculate a growing/expanding percentile in Polars using cumulative_eval?


I'm trying to calculate a growing percentile for a column in my Polars DataFrame. The goal is to calculate the percentile from the beginning of the column up until the current observation.

Example Data:

import polars as pl

data = pl.DataFrame({
    "premia_pct": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
})

I want to create a new column "premia_percentile" that calculates the percentile of "premia_pct" from the start of the column up to the current row.

I tried using the cumulative_eval function in Polars as follows:

df = data.with_columns(
    pl.col("premia_pct").cumulative_eval(
        lambda group: (group.rank() / group.len()).last(),
        min_periods=1
    ).alias("premia_percentile")
)

However, I get the following error: AttributeError: 'function' object has no attribute '_pyexpr'

I have also tried a for loop:

        for i in range(1, data.shape[0] + 1):
            data["premia_percentile"][:i] = data["premia_pct"][:i].rank() / i

        return data

but this is not how poalrs is supposed to be used, and it doesn't work either. Even if I use pl.slice(1,i) instead of [:i]

Maybe you could use something similar to the pandas.expanding()?

This is what I expect the output to be:

    def _calculate_growing_percentile(
        self, data, column_name: str = "premia_percentile"
    ):
        """
        Calculate a growing percentile of a column in a DataFrame.

        Parameters:
        df (pl.DataFrame): The DataFrame.
        column_name (str): The name of the column.

        Returns:
        pl.DataFrame: The DataFrame with the new column.
        """
        # Initialize a pandas df
        data = data.to_pandas()

        # Calculate the growing percentile
        data[f"{column_name}"] = data["premia_pct"].expanding().rank(pct=True)
        data = pl.from_pandas(data)
        return data

Is there a way to calculate a growing percentile in Polars using cumulative_eval or any other function? Any help would be greatly appreciated.

Here is a similar post SO Question


Solution

  • Here is an equivalent function to the pandas implementation suggested above:

    rank_calc = (pl.element().rank() / pl.element().len()).last()
    df = data.with_columns(
        pl.col("premia_pct").cumulative_eval(rank_calc, min_periods=1).alias("premia_percentile")
    )
    

    The percentile rank calculation is:

    (pl.element().rank() / pl.element().len()).last()