[SOLVED] Bin a polars column in the same way as pandas.cut

Bin a polars column in the same way as pandas.cut

I'm trying to use polars.Series.cut to reproduce the data binning behavior of pandas.cut.

MRE of values and breakpoints:

scores = [1111, 65, 88, -1111, 92]
breaks = [0, 50, 60, 70, 80, 90, 100]

With pandas.cut, the bin is null if the value is outside the defined edges:

import pandas as pd

df = pd.DataFrame({'score': scores})
df['bin'] = pd.cut(df['score'], breaks)

#    score            bin
# 0   1111            NaN  <- null in pandas
# 1     65   (60.0, 70.0]
# 2     88   (80.0, 90.0]
# 3  -1111            NaN  <- null in pandas
# 4     92  (90.0, 100.0]

But with polars.Series.cut, it seems we're forced to include the (-inf, ...] and (..., inf] bins:

import polars as pl

df = pl.DataFrame({'score': scores})
df = df.with_columns(bin=pl.col('score').cut(breaks))

# shape: (5, 2)
# ┌───────┬────────────┐
# │ score ┆ bin        │
# │ ---   ┆ ---        │
# │ i64   ┆ cat        │
# ╞═══════╪════════════╡
# │ 1111  ┆ (100, inf] │  <- not null in polars
# │ 65    ┆ (60, 70]   │
# │ 88    ┆ (80, 90]   │
# │ -1111 ┆ (-inf, 0]  │  <- not null in polars
# │ 92    ┆ (90, 100]  │
# └───────┴────────────┘

How can we replicate the pandas.cut bins in Polars?

Solution

What about using is_between as a filter?

df.with_columns(bin=pl.when(pl.col('score').is_between(breaks[0], breaks[-1], closed='right'))
                      .then(pl.col('score').cut(breaks))
               )

Or, better, to avoid recomputing the category, pre-filter:

df.with_columns(bin=pl.when(pl.col('score').is_between(breaks[0], breaks[-1], closed='right'))
                      .then(pl.col('score')).cut(breaks)
               )

Output:

┌───────┬───────────┐
│ score ┆ bin       │
│ ---   ┆ ---       │
│ i64   ┆ cat       │
╞═══════╪═══════════╡
│ 1111  ┆ null      │
│ 65    ┆ (60, 70]  │
│ 88    ┆ (80, 90]  │
│ -1111 ┆ null      │
│ 92    ┆ (90, 100] │
└───────┴───────────┘