I'm trying to use polars.Series.cut
to reproduce the data binning behavior of pandas.cut
.
MRE of values and breakpoints:
scores = [1111, 65, 88, -1111, 92]
breaks = [0, 50, 60, 70, 80, 90, 100]
With pandas.cut
, the bin is null if the value is outside the defined edges:
import pandas as pd
df = pd.DataFrame({'score': scores})
df['bin'] = pd.cut(df['score'], breaks)
# score bin
# 0 1111 NaN <- null in pandas
# 1 65 (60.0, 70.0]
# 2 88 (80.0, 90.0]
# 3 -1111 NaN <- null in pandas
# 4 92 (90.0, 100.0]
But with polars.Series.cut
, it seems we're forced to include the (-inf, ...]
and (..., inf]
bins:
import polars as pl
df = pl.DataFrame({'score': scores})
df = df.with_columns(bin=pl.col('score').cut(breaks))
# shape: (5, 2)
# ┌───────┬────────────┐
# │ score ┆ bin │
# │ --- ┆ --- │
# │ i64 ┆ cat │
# ╞═══════╪════════════╡
# │ 1111 ┆ (100, inf] │ <- not null in polars
# │ 65 ┆ (60, 70] │
# │ 88 ┆ (80, 90] │
# │ -1111 ┆ (-inf, 0] │ <- not null in polars
# │ 92 ┆ (90, 100] │
# └───────┴────────────┘
How can we replicate the pandas.cut
bins in Polars?
What about using is_between
as a filter?
df.with_columns(bin=pl.when(pl.col('score').is_between(breaks[0], breaks[-1], closed='right'))
.then(pl.col('score').cut(breaks))
)
Or, better, to avoid recomputing the category, pre-filter:
df.with_columns(bin=pl.when(pl.col('score').is_between(breaks[0], breaks[-1], closed='right'))
.then(pl.col('score')).cut(breaks)
)
Output:
┌───────┬───────────┐
│ score ┆ bin │
│ --- ┆ --- │
│ i64 ┆ cat │
╞═══════╪═══════════╡
│ 1111 ┆ null │
│ 65 ┆ (60, 70] │
│ 88 ┆ (80, 90] │
│ -1111 ┆ null │
│ 92 ┆ (90, 100] │
└───────┴───────────┘