I have a data frame of the type
pl.DataFrame({"id":[1,2,2,2,2,3,3,3],"value":[5,6,1,2,3,30,10,20]})
┌─────┬───────┐
│ id ┆ value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═══════╡
│ 1 ┆ 5 │
│ 2 ┆ 6 │
│ 2 ┆ 1 │
│ 2 ┆ 2 │
│ 2 ┆ 3 │
│ 3 ┆ 30 │
│ 3 ┆ 10 │
│ 3 ┆ 20 │
└─────┴───────┘
In case of multiple rows with the same id the value of the first row is always the sum of the value's of the remaining rows. My goal is the remove these sums from the data frame, i.e. I want to obtain
┌─────┬───────┐
│ id ┆ value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═══════╡
│ 1 ┆ 5 │
│ 2 ┆ 1 │
│ 2 ┆ 2 │
│ 2 ┆ 3 │
│ 3 ┆ 10 │
│ 3 ┆ 20 │
└─────┴───────┘
Is there any efficient way to do this in polars? I am aware of group_by but I don't know how to do conditional logic based on the number of elements in a list.
You can use filter with window functions.
The two cases you want to keep are that (the count of ids is greater than one and that it isn't the first row) or that the count of ids is just 1.
That looks like this:
df.filter(
(
(pl.count().over('id')>1) &
(pl.col('value')!=pl.col('value').first().over('id'))
) |
(pl.count().over('id')==1)
)
shape: (6, 2)
┌─────┬───────┐
│ id ┆ value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═══════╡
│ 1 ┆ 5 │
│ 2 ┆ 1 │
│ 2 ┆ 2 │
│ 2 ┆ 3 │
│ 3 ┆ 10 │
│ 3 ┆ 20 │
└─────┴───────┘