I'm trying to calculate a growing percentile for a column in my Polars DataFrame. The goal is to calculate the percentile from the beginning of the column up until the current observation.
Example Data:
import polars as pl
data = pl.DataFrame({
"premia_pct": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
})
I want to create a new column "premia_percentile" that calculates the percentile of "premia_pct" from the start of the column up to the current row.
I tried using the cumulative_eval function in Polars as follows:
df = data.with_columns(
pl.col("premia_pct").cumulative_eval(
lambda group: (group.rank() / group.len()).last(),
min_periods=1
).alias("premia_percentile")
)
However, I get the following error: AttributeError: 'function' object has no attribute '_pyexpr'
I have also tried a for loop:
for i in range(1, data.shape[0] + 1):
data["premia_percentile"][:i] = data["premia_pct"][:i].rank() / i
return data
but this is not how poalrs is supposed to be used, and it doesn't work either. Even if I use pl.slice(1,i)
instead of [:i]
Maybe you could use something similar to the pandas.expanding()
?
This is what I expect the output to be:
def _calculate_growing_percentile(
self, data, column_name: str = "premia_percentile"
):
"""
Calculate a growing percentile of a column in a DataFrame.
Parameters:
df (pl.DataFrame): The DataFrame.
column_name (str): The name of the column.
Returns:
pl.DataFrame: The DataFrame with the new column.
"""
# Initialize a pandas df
data = data.to_pandas()
# Calculate the growing percentile
data[f"{column_name}"] = data["premia_pct"].expanding().rank(pct=True)
data = pl.from_pandas(data)
return data
Is there a way to calculate a growing percentile in Polars using cumulative_eval or any other function? Any help would be greatly appreciated.
Here is a similar post SO Question
Here is an equivalent function to the pandas implementation suggested above:
rank_calc = (pl.element().rank() / pl.element().len()).last()
df = data.with_columns(
pl.col("premia_pct").cumulative_eval(rank_calc, min_periods=1).alias("premia_percentile")
)
The percentile rank calculation is:
(pl.element().rank() / pl.element().len()).last()