Given a polars dataframe, I want to extract all duplicated rows while also applying an additional filter condition, for example:
import polars as pl
df = pl.DataFrame({
"name": ["Alice", "Bob", "Alice", "David", "Eve", "Bob", "Frank"],
"city": ["NY", "LA", "NY", "SF", "LA", "LA", "NY"],
"age": [25, 30, 25, 35, 28, 30, 40]
})
# Trying this:
df.filter((df.is_duplicated()) & (pl.col("city") == "NY")) # error
However, this results in an error:
SchemaError: cannot unpack series of type
object
intobool
Which alludes that df.is_duplicated()
returns a series of type object
, but in reality, it's a Boolean
Series.
Surprisingly, reordering the predicates by placing the expression first makes it work (but why?):
df.filter((pl.col("city") == "NY") & (df.is_duplicated())) # works!
correctly outputs:
shape: (2, 3)
┌───────┬──────┬─────┐
│ name ┆ city ┆ age │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════╪══════╪═════╡
│ Alice ┆ NY ┆ 25 │
│ Alice ┆ NY ┆ 25 │
└───────┴──────┴─────┘
I understand that the optimal approach when filtering for duplicates based on a subset of columns is to use pl.struct
, like:
df.filter((pl.struct(df.columns).is_duplicated()) & (pl.col("city") == "NY")) # works
Which works fine with the additional filter condition.
However, I'm intentionally not using pl.struct
because my real dataframe has 40 columns, and I want to check for duplicated rows based on all the columns except three, so I did the following:
df.filter(df.drop("col1", "col2", "col3").is_duplicated())
Which works fine and is much more convenient than writing all 37 columns in a pl.struct
. However, this breaks when adding an additional filter condition to the right, but not to the left:
df.filter(
(df.drop("col1", "col2", "col3").is_duplicated()) & (pl.col("col5") == "something")
) # breaks!
df.filter(
(pl.col("col5") == "something") & (df.drop("col1", "col2", "col3").is_duplicated())
) # works!
Why does the ordering of predicates (Series & Expression vs Expression & Series) matter inside .filter()
in this case?
Is this intended behavior in Polars, or a bug?
The error is not .filter()
specific. and I don't think it's a bug.
Expressions allow you to use Series on the RHS, and it will return an expression.
pl.lit(True) & pl.Series([1, 2])
# <Expr ['[(true) & (Series)]'] at 0x134D05F90>
But the other way round doesn't make sense, and errors.
pl.Series([1, 2]) & pl.lit(True)
# ComputeError: cannot cast 'Object' type
As for using a struct, you can wrap .exclude()
in a struct.
pl.struct(pl.exclude("col1", "col2", "col3")).is_duplicated()