pythonpython-3.xdataframepython-polarspolars

How to properly extract all duplicated rows with a condition in a Polars DataFrame?


Given a polars dataframe, I want to extract all duplicated rows while also applying an additional filter condition, for example:

import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Alice", "David", "Eve", "Bob", "Frank"],
    "city": ["NY", "LA", "NY", "SF", "LA", "LA", "NY"],
    "age": [25, 30, 25, 35, 28, 30, 40]
})

# Trying this:
df.filter((df.is_duplicated()) & (pl.col("city") == "NY"))  # error

However, this results in an error:

SchemaError: cannot unpack series of type object into bool

Which alludes that df.is_duplicated() returns a series of type object, but in reality, it's a Boolean Series.

Surprisingly, reordering the predicates by placing the expression first makes it work (but why?):
df.filter((pl.col("city") == "NY") & (df.is_duplicated())) # works! correctly outputs:

shape: (2, 3)
┌───────┬──────┬─────┐
│ name  ┆ city ┆ age │
│ ---   ┆ ---  ┆ --- │
│ str   ┆ str  ┆ i64 │
╞═══════╪══════╪═════╡
│ Alice ┆ NY   ┆ 25  │
│ Alice ┆ NY   ┆ 25  │
└───────┴──────┴─────┘

I understand that the optimal approach when filtering for duplicates based on a subset of columns is to use pl.struct, like:
df.filter((pl.struct(df.columns).is_duplicated()) & (pl.col("city") == "NY")) # works
Which works fine with the additional filter condition.

However, I'm intentionally not using pl.struct because my real dataframe has 40 columns, and I want to check for duplicated rows based on all the columns except three, so I did the following:
df.filter(df.drop("col1", "col2", "col3").is_duplicated()) Which works fine and is much more convenient than writing all 37 columns in a pl.struct. However, this breaks when adding an additional filter condition to the right, but not to the left:

df.filter(
    (df.drop("col1", "col2", "col3").is_duplicated()) & (pl.col("col5") == "something")
    )  # breaks!

df.filter(
    (pl.col("col5") == "something") & (df.drop("col1", "col2", "col3").is_duplicated())
    )  # works!

Why does the ordering of predicates (Series & Expression vs Expression & Series) matter inside .filter() in this case? Is this intended behavior in Polars, or a bug?


Solution

  • The error is not .filter() specific. and I don't think it's a bug.

    Expressions allow you to use Series on the RHS, and it will return an expression.

    pl.lit(True) & pl.Series([1, 2])
    # <Expr ['[(true) & (Series)]'] at 0x134D05F90>
    

    But the other way round doesn't make sense, and errors.

    pl.Series([1, 2]) & pl.lit(True)
    # ComputeError: cannot cast 'Object' type
    

    As for using a struct, you can wrap .exclude() in a struct.

    pl.struct(pl.exclude("col1", "col2", "col3")).is_duplicated()