I tried so hard with ChatGPT but couldn't make this work and it couldn't solve it either. How am I supposed to assign or increment set of values using a boolean mask? This is what I was doing with pandas.
validation_predictions = model.predict(validation_dataset)
df.loc[validation_mask, 'prediction'] += validation_predictions / len(seeds)
validation_mask
has N number of True
and validation_predictions
is a numpy
array of size N, so an assignment like this works fine. However I couldn't achieve the same thing with polars.
I tried when/then/otherwise chain but it throws an error since validation_predictions
size doesn't match with entire dataframe's size.
df = df.with_columns(
pl.when(validation_mask)
.then(pl.col('prediction') + (validation_predictions / len(seeds)))
.otherwise(pl.col('prediction'))
.alias('prediction')
)
# ShapeError: cannot evaluate two Series of different lengths (6 and 5)
For reproducibility purposes:
import polars as pl
import numpy as np
seeds = [42, 1337, 0]
df = pl.DataFrame({
"some_column": [10, 20, 30, 40, 50, 60],
"prediction": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
})
validation_mask = df["some_column"] > 15
validation_predictions = np.array([0.5, 1.5, 2.5, 3.5, 4.5])
Expected result:
shape: (6, 2)
┌─────────────┬────────────┐
│ some_column ┆ prediction │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════════════╪════════════╡
│ 10 ┆ 1.0 │
│ 20 ┆ 2.166667 │
│ 30 ┆ 3.5 │
│ 40 ┆ 4.833333 │
│ 50 ┆ 6.166667 │
│ 60 ┆ 7.5 │
└─────────────┴────────────┘
The array of validation_predictions
is too small. I assume that this array will always be the length of (df.filter('condition')
.
import polars as pl
import numpy as np
seeds = [42, 1337, 0]
df = pl.DataFrame(
{
"some_column": [10, 20, 30, 40, 50, 60],
"prediction": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
}
)
validation_mask = df["some_column"] > 15
validation_predictions = np.array([0.5, 1.5, 2.5, 3.5, 4.5])
assert sum(validation_mask) == len(validation_predictions)
# extend the validation predictions:
new_arr = np.empty(len(df))
new_arr[:] = np.nan
new_arr[validation_mask] = validation_predictions
df = df.with_columns(
pl.when(validation_mask)
.then(pl.col("prediction") + (new_arr / len(seeds)))
.otherwise(pl.col("prediction"))
.alias("prediction")
)
With this method the validation predictions is extended with nans.