pythondataframeduplicatespython-polars

How to filter duplicates based on multiple columns in Polars?


I was earlier able to filter duplicates based on multiple columns using df.filter(pl.col(['A','C']).is_duplicated()) but after the latest version update this is not working.

import polars as pl


df = pl.DataFrame(
    {
        "A": [1,4,4,7,7,10,10,13,16],
        "B": [2,5,5,8,18,11,11,14,17],
        "C": [3,6,6,9,9,12,12,15,18]        
    }
)

Both attempts generate an error:

df.filter(pl.col(['A','C']).is_duplicated())
# ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expression
df.filter(
    df.select(pl.col(['A','C']).is_duplicated())
)
# TypeError: invalid predicate for `filter`

Solution

  • This behavior was noted as ambiguous in 0.16.10 and would return this error:

    exceptions.ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions: 
    
            col("A").is_duplicated(),
            col("C").is_duplicated(),
    This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.
    

    However 0.19.0 removed the deprecated behavior of all/any replaced by all_horizontal and any_horizontal. To get the same behavior as the pre-0.16.10, use df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated()))

    I've modified the input slightly to reflect the differences between any_horizontal and all_horizontal

    import polars as pl
    
    df = pl.DataFrame(
        {
            "A": [1,3,4,7,7,10,10,13,16],
            "B": [2,5,5,8,18,11,11,14,17],
            "C": [3,6,6,9,9,12,12,15,18]        
        }
    )
    
    # print("legacy run in 0.16.9: ", df.filter(pl.col(['A','C']).is_duplicated()))
    print("all_horizontal: ", df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated())))
    print("any_horizontal: ", df.filter(pl.any_horizontal(pl.col(['A','C']).is_duplicated())))
    
    legacy run in 0.16.9:  shape: (4, 3)
    ┌─────┬─────┬─────┐
    │ A   ┆ B   ┆ C   │
    │ --- ┆ --- ┆ --- │
    │ i64 ┆ i64 ┆ i64 │
    ╞═════╪═════╪═════╡
    │ 7   ┆ 8   ┆ 9   │
    │ 7   ┆ 18  ┆ 9   │
    │ 10  ┆ 11  ┆ 12  │
    │ 10  ┆ 11  ┆ 12  │
    └─────┴─────┴─────┘
    
    all_horizontal:  shape: (4, 3)
    ┌─────┬─────┬─────┐
    │ A   ┆ B   ┆ C   │
    │ --- ┆ --- ┆ --- │
    │ i64 ┆ i64 ┆ i64 │
    ╞═════╪═════╪═════╡
    │ 7   ┆ 8   ┆ 9   │
    │ 7   ┆ 18  ┆ 9   │
    │ 10  ┆ 11  ┆ 12  │
    │ 10  ┆ 11  ┆ 12  │
    └─────┴─────┴─────┘
    
    any_horizontal:  shape: (6, 3)
    ┌─────┬─────┬─────┐
    │ A   ┆ B   ┆ C   │
    │ --- ┆ --- ┆ --- │
    │ i64 ┆ i64 ┆ i64 │
    ╞═════╪═════╪═════╡
    │ 3   ┆ 5   ┆ 6   │
    │ 4   ┆ 5   ┆ 6   │
    │ 7   ┆ 8   ┆ 9   │
    │ 7   ┆ 18  ┆ 9   │
    │ 10  ┆ 11  ┆ 12  │
    │ 10  ┆ 11  ┆ 12  │
    └─────┴─────┴─────┘