pythonpython-polars

Polars arithmetic operation using a boolean mask and different sized array


I tried so hard with ChatGPT but couldn't make this work and it couldn't solve it either. How am I supposed to assign or increment set of values using a boolean mask? This is what I was doing with pandas.

validation_predictions = model.predict(validation_dataset)
df.loc[validation_mask, 'prediction'] += validation_predictions / len(seeds)

validation_mask has N number of True and validation_predictions is a numpy array of size N, so an assignment like this works fine. However I couldn't achieve the same thing with polars.

I tried when/then/otherwise chain but it throws an error since validation_predictions size doesn't match with entire dataframe's size.

df = df.with_columns(
    pl.when(validation_mask)
    .then(pl.col('prediction') + (validation_predictions / len(seeds)))
    .otherwise(pl.col('prediction'))
    .alias('prediction')
)
# ShapeError: cannot evaluate two Series of different lengths (6 and 5)

For reproducibility purposes:

import polars as pl
import numpy as np

seeds = [42, 1337, 0]

df = pl.DataFrame({
    "some_column": [10, 20, 30, 40, 50, 60],
    "prediction": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
})

validation_mask = df["some_column"] > 15
validation_predictions = np.array([0.5, 1.5, 2.5, 3.5, 4.5])

Expected result:

shape: (6, 2)
┌─────────────┬────────────┐
│ some_column ┆ prediction │
│ ---         ┆ ---        │
│ i64         ┆ f64        │
╞═════════════╪════════════╡
│ 10          ┆ 1.0        │
│ 20          ┆ 2.166667   │
│ 30          ┆ 3.5        │
│ 40          ┆ 4.833333   │
│ 50          ┆ 6.166667   │
│ 60          ┆ 7.5        │
└─────────────┴────────────┘

Solution

  • The array of validation_predictions is too small. I assume that this array will always be the length of (df.filter('condition').

    import polars as pl
    import numpy as np
    
    seeds = [42, 1337, 0]
    
    df = pl.DataFrame(
        {
            "some_column": [10, 20, 30, 40, 50, 60],
            "prediction": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
        }
    )
    
    validation_mask = df["some_column"] > 15
    validation_predictions = np.array([0.5, 1.5, 2.5, 3.5, 4.5])
    
    assert sum(validation_mask) == len(validation_predictions)
    
    # extend the validation predictions:
    new_arr = np.empty(len(df))
    new_arr[:] = np.nan
    new_arr[validation_mask] = validation_predictions
    
    df = df.with_columns(
        pl.when(validation_mask)
        .then(pl.col("prediction") + (new_arr / len(seeds)))
        .otherwise(pl.col("prediction"))
        .alias("prediction")
    )
    

    With this method the validation predictions is extended with nans.