pythondataframepython-polarsisin

Pandas isin function in polars


Once in a while I get to the point where I need to run the following line:

DF['is_flagged'] = DF['id'].isin(DF2[DF2['flag']==1]['id'])

Lately I started using polars, and I wonder how to convert it easily to polars.

For example:

df1 = pl.DataFrame({
'Animal_id': [1, 2, 3, 4, 5, 6, 7],     
'age': [4, 6, 3, 8, 3, 8, 9] })

df2 = pl.DataFrame({     
'Animal_id': [1, 2, 3, 4, 5, 6, 7],     
'Animal_type': ['cat', 'dog', 'cat', 'cat', 'dog', 'dog', 'cat'] })

Expected output:

shape: (7, 3)
┌───────────┬─────┬────────┐
│ animal_id ┆ age ┆ is_dog │
│ ---       ┆ --- ┆ ---    │
│ i64       ┆ i64 ┆ i64    │
╞═══════════╪═════╪════════╡
│ 1         ┆ 4   ┆ 0      │
│ 2         ┆ 6   ┆ 1      │
│ 3         ┆ 3   ┆ 0      │
│ 4         ┆ 8   ┆ 0      │
│ 5         ┆ 3   ┆ 1      │
│ 6         ┆ 8   ┆ 1      │
│ 7         ┆ 9   ┆ 0      │
└───────────┴─────┴────────┘

Without using flag and then join

I tried to use the is_in() function but this didn’t worked.


Solution

  • So to be honest I am not quite sure from your question, your pandas snippet and your example what your desired solution is, but here are my three takes.

    import polars as pl
    
    df1 = pl.DataFrame(
        {"Animal_id": [1, 2, 3, 4, 5, 6, 7], "age": [4, 6, 3, 8, 3, 8, 9]}
    ).lazy()
    
    df2 = pl.DataFrame(
        {
            "Animal_id": [1, 2, 3, 4, 5, 6, 7],
            "Animal_type": ["cat", "dog", "cat", "cat", "dog", "dog", "cat"],
        }
    ).lazy()
    

    1 Solution

    So this one is a small adaptation of the solution of @ignoring_gravity. So the assumption in his solution is that the DataFrames have the same length and the Animal_id matches in both tables. If that was your goal I want to give you another solution because by subsetting (["is_dog"]) you lose the possibility to use the lazy api.

    df1.with_context(
        df2.select(pl.col("Animal_type").is_in(["dog"]).cast(int).alias("is_dog"))
    ).select(["Animal_id", "age", "is_dog"]).collect()
    

    2 Solution

    So in case you want something more similar to your pandas snippet and because you wrote you don't want to have a join.

    df1.with_context(
        df2.filter(pl.col("Animal_type") == "dog").select(
            pl.col("Animal_id").alias("valid_ids")
        )
    ).with_columns(
        [pl.col("Animal_id").is_in(pl.col("valid_ids")).cast(int).alias("is_dog")]
    ).collect()
    

    3 Solution

    So this would be the solution with a join. In my opinion the best solution regarding your example and example output, but maybe there are other reasons that speak against a join, which aren't apparent from your example.

    df1.join(
        df2.select(
            ["Animal_id", pl.col("Animal_type").is_in(["dog"]).cast(int).alias("is_dog")]
        ),
        on=["Animal_id"],
    ).collect()