pythonpython-polars

Python Polars Encoding Continous Variables from Breakpoints in another DataFrame


The breakpoints data is the following:

breakpoints = pl.DataFrame(
    {
        "features": ["feature_0", "feature_0", "feature_1"],
        "breakpoints": [0.1, 0.5, 1],
        "n_possible_bins": [3, 3, 2],
    }
)
print(breakpoints)
out:
shape: (3, 3)
┌───────────┬─────────────┬─────────────────┐
│ features  ┆ breakpoints ┆ n_possible_bins │
│ ---       ┆ ---         ┆ ---             │
│ str       ┆ f64         ┆ i64             │
╞═══════════╪═════════════╪═════════════════╡
│ feature_0 ┆ 0.1         ┆ 3               │
│ feature_0 ┆ 0.5         ┆ 3               │
│ feature_1 ┆ 1.0         ┆ 2               │
└───────────┴─────────────┴─────────────────┘

The df has two continous variables that we wish to encode according to the breakpoints DataFrame:

df = pl.DataFrame(
    {"feature_0": [0.05, 0.2, 0.6, 0.8], "feature_1": [0.5, 1.5, 1.0, 1.1]}
)
print(df)
out:
shape: (4, 2)
┌───────────┬───────────┐
│ feature_0 ┆ feature_1 │
│ ---       ┆ ---       │
│ f64       ┆ f64       │
╞═══════════╪═══════════╡
│ 0.05      ┆ 0.5       │
│ 0.2       ┆ 1.5       │
│ 0.6       ┆ 1.0       │
│ 0.8       ┆ 1.1       │
└───────────┴───────────┘

After the encoding we should have the resulting DataFrame encoded_df:

encoded_df = pl.DataFrame({"feature_0": [0, 1, 2, 2], "feature_1": [0, 1, 0, 1]})

print(encoded_df)
out:
shape: (4, 2)
┌───────────┬───────────┐
│ feature_0 ┆ feature_1 │
│ ---       ┆ ---       │
│ i64       ┆ i64       │
╞═══════════╪═══════════╡
│ 0         ┆ 0         │
│ 1         ┆ 1         │
│ 2         ┆ 0         │
│ 2         ┆ 1         │
└───────────┴───────────┘
  1. We can assume that the unique list of features in encoded_df are also available in breakpoints
  2. Labels should be an array: np.array([str(i) for i in range(n_possible_bins)]), assuming n_possible_bins is a positive integer. n_possible_bins may be different across features.
  3. All the encoding follows left_closed=False where the bins are defined as (breakpoint, next breakpoint]

I know that Polars.Expr.cut() takes in breaks parameter as Sequence[float], but how do I pass in these breakpoints and labels from the breakpoints DataFrame effectively?


Solution

  • Given that breakpoints will most likely be a very small DataFrame, I think the simplest and most efficient solution is something like:

    import polars as pl
    
    breakpoints = pl.DataFrame(
        {
            "features": ["feature_0", "feature_0", "feature_1"],
            "breakpoints": [0.1, 0.5, 1],
            "n_possible_feature_brakes": [3, 3, 2],
        }
    )
    
    df = pl.DataFrame(
        {"feature_0": [0.05, 0.2, 0.6, 0.8], "feature_1": [0.5, 1.5, 1.0, 1.1]}
    )
    
    # Aggregate the breakpoints by feature
    feature_breaks = breakpoints.group_by("features").agg(
        pl.col("breakpoints").sort().alias("breaks")
    )
    
    # For each feature, call `pl.cut` with the respective `breaks`
    result = df.select(
        pl.col(feat).cut(breaks, labels=[str(x) for x in range(len(breaks) + 1)])
        for feat, breaks in feature_breaks.iter_rows()
    )
    

    Output:

    >>> feature_breaks
    
    shape: (2, 2)
    ┌───────────┬────────────┐
    │ features  ┆ breaks     │
    │ ---       ┆ ---        │
    │ str       ┆ list[f64]  │
    ╞═══════════╪════════════╡
    │ feature_0 ┆ [0.1, 0.5] │
    │ feature_1 ┆ [1.0]      │
    └───────────┴────────────┘
    
    >>> result
    
    shape: (4, 2)
    ┌───────────┬───────────┐
    │ feature_0 ┆ feature_1 │
    │ ---       ┆ ---       │
    │ cat       ┆ cat       │
    ╞═══════════╪═══════════╡
    │ 0         ┆ 0         │
    │ 1         ┆ 1         │
    │ 2         ┆ 0         │
    │ 2         ┆ 1         │
    └───────────┴───────────┘