dataframepython-polarsbinning

Bin a polars column in the same way as pandas.cut


I'm trying to use polars.Series.cut to reproduce the data binning behavior of pandas.cut.

MRE of values and breakpoints:

scores = [1111, 65, 88, -1111, 92]
breaks = [0, 50, 60, 70, 80, 90, 100]

With pandas.cut, the bin is null if the value is outside the defined edges:

import pandas as pd

df = pd.DataFrame({'score': scores})
df['bin'] = pd.cut(df['score'], breaks)

#    score            bin
# 0   1111            NaN  <- null in pandas
# 1     65   (60.0, 70.0]
# 2     88   (80.0, 90.0]
# 3  -1111            NaN  <- null in pandas
# 4     92  (90.0, 100.0]

But with polars.Series.cut, it seems we're forced to include the (-inf, ...] and (..., inf] bins:

import polars as pl

df = pl.DataFrame({'score': scores})
df = df.with_columns(bin=pl.col('score').cut(breaks))

# shape: (5, 2)
# ┌───────┬────────────┐
# │ score ┆ bin        │
# │ ---   ┆ ---        │
# │ i64   ┆ cat        │
# ╞═══════╪════════════╡
# │ 1111  ┆ (100, inf] │  <- not null in polars
# │ 65    ┆ (60, 70]   │
# │ 88    ┆ (80, 90]   │
# │ -1111 ┆ (-inf, 0]  │  <- not null in polars
# │ 92    ┆ (90, 100]  │
# └───────┴────────────┘

How can we replicate the pandas.cut bins in Polars?


Solution

  • What about using is_between as a filter?

    df.with_columns(bin=pl.when(pl.col('score').is_between(breaks[0], breaks[-1], closed='right'))
                          .then(pl.col('score').cut(breaks))
                   )
    

    Or, better, to avoid recomputing the category, pre-filter:

    df.with_columns(bin=pl.when(pl.col('score').is_between(breaks[0], breaks[-1], closed='right'))
                          .then(pl.col('score')).cut(breaks)
                   )
    

    Output:

    ┌───────┬───────────┐
    │ score ┆ bin       │
    │ ---   ┆ ---       │
    │ i64   ┆ cat       │
    ╞═══════╪═══════════╡
    │ 1111  ┆ null      │
    │ 65    ┆ (60, 70]  │
    │ 88    ┆ (80, 90]  │
    │ -1111 ┆ null      │
    │ 92    ┆ (90, 100] │
    └───────┴───────────┘