I have a dataframe with a certain number of groups, containing a weight column and a list of values, which can be of arbitrary length, so for example:
df = pl.DataFrame(
{
"Group": ["Group1", "Group2", "Group3"],
"Weight": [100.0, 200.0, 300.0],
"Vals": [[0.5, 0.5, 0.8],[0.5, 0.5, 0.8], [0.7, 0.9]]
}
)
┌────────┬────────┬─────────────────┐
│ Group ┆ Weight ┆ Vals │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] │
╞════════╪════════╪═════════════════╡
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, 0.8] │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, 0.8] │
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] │
└────────┴────────┴─────────────────┘
My goal is to calculate a 'weighted' column, which would be the multiple of each item in the values list with the value in the weight column:
┌────────┬────────┬─────────────────┬─────────────────┐
│ Group ┆ Weight ┆ Vals ┆ Weighted │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] ┆ list[i64] │
╞════════╪════════╪═════════════════╪═════════════════╡
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, 0.8] ┆ [50, 50, 80] │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, 0.8] ┆ [100, 100, 160] │
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] ┆ [210, 270] │
└────────┴────────┴─────────────────┴─────────────────┘
I've tried a few different things:
df.with_columns(
pl.col("Vals").list.eval(pl.element() * 3).alias("Weight1"), #Multiplying with literal works
pl.col("Vals").list.eval(pl.element() * pl.col("Weight")).alias("Weight2"), #Does not work
pl.col("Vals").list.eval(pl.element() * pl.col("Unknown")).alias("Weight3"), #Unknown columns give same value
pl.col("Vals").list.eval(pl.col("Vals") * pl.col("Weight")).alias("Weight4"), #Same effect
# pl.col('Vals') * 3 -> gives an error
)
┌────────┬────────┬────────────┬────────────┬──────────────┬──────────────┬────────────────────┐
│ Group ┆ Weight ┆ Vals ┆ Weight1 ┆ Weight2 ┆ Weight3 ┆ Weight4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] ┆ list[f64] ┆ list[f64] ┆ list[f64] ┆ list[f64] │
╞════════╪════════╪════════════╪════════════╪══════════════╪══════════════╪════════════════════╡
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, ┆ [1.5, 1.5, ┆ [0.25, 0.25, ┆ [0.25, 0.25, ┆ [0.25, 0.25, 0.64] │
│ ┆ ┆ 0.8] ┆ 2.4] ┆ 0.64] ┆ 0.64] ┆ │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, ┆ [1.5, 1.5, ┆ [0.25, 0.25, ┆ [0.25, 0.25, ┆ [0.25, 0.25, 0.64] │
│ ┆ ┆ 0.8] ┆ 2.4] ┆ 0.64] ┆ 0.64] ┆ │
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] ┆ [2.1, 2.7] ┆ [0.49, 0.81] ┆ [0.49, 0.81] ┆ [0.49, 0.81] │
└────────┴────────┴────────────┴────────────┴──────────────┴──────────────┴────────────────────┘
Unless I'm not understanding it correctly, it seems like you're unable to access columns outside of the list from within the eval function. Perhaps there might be a way to use list comprehension within the statement, but that doesn't really seem like a neat solution.
What would be the recommended approach here? Any help would be appreciated!
As of the latest version of Polars, this is now a the correct syntax:
df = pl.DataFrame(
{
"Group": ["Group1", "Group2", "Group3"],
"Weight": [100.0, 200.0, 300.0],
"Vals": [[0.5, 0.5, 0.8],[0.5, 0.5, 0.8], [0.7, 0.9]]
}
)
(df
.explode('Vals')
.with_columns(Weighted = pl.col('Weight')*pl.col('Vals'))
.group_by('Group')
.agg(
pl.col('Weight').first(),
pl.col('Vals'),
pl.col('Weighted')
)
)
shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group ┆ Weight ┆ Vals ┆ Weighted │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] ┆ list[f64] │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] ┆ [210.0, 270.0] │
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0] │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
└────────┴────────┴─────────────────┴───────────────────────┘