Once in a while I get to the point where I need to run the following line:
DF['is_flagged'] = DF['id'].isin(DF2[DF2['flag']==1]['id'])
Lately I started using polars, and I wonder how to convert it easily to polars.
For example:
df1 = pl.DataFrame({
'Animal_id': [1, 2, 3, 4, 5, 6, 7],
'age': [4, 6, 3, 8, 3, 8, 9] })
df2 = pl.DataFrame({
'Animal_id': [1, 2, 3, 4, 5, 6, 7],
'Animal_type': ['cat', 'dog', 'cat', 'cat', 'dog', 'dog', 'cat'] })
Expected output:
shape: (7, 3)
┌───────────┬─────┬────────┐
│ animal_id ┆ age ┆ is_dog │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════╡
│ 1 ┆ 4 ┆ 0 │
│ 2 ┆ 6 ┆ 1 │
│ 3 ┆ 3 ┆ 0 │
│ 4 ┆ 8 ┆ 0 │
│ 5 ┆ 3 ┆ 1 │
│ 6 ┆ 8 ┆ 1 │
│ 7 ┆ 9 ┆ 0 │
└───────────┴─────┴────────┘
Without using flag and then join
I tried to use the is_in()
function but this didn’t worked.
So to be honest I am not quite sure from your question, your pandas snippet and your example what your desired solution is, but here are my three takes.
import polars as pl
df1 = pl.DataFrame(
{"Animal_id": [1, 2, 3, 4, 5, 6, 7], "age": [4, 6, 3, 8, 3, 8, 9]}
).lazy()
df2 = pl.DataFrame(
{
"Animal_id": [1, 2, 3, 4, 5, 6, 7],
"Animal_type": ["cat", "dog", "cat", "cat", "dog", "dog", "cat"],
}
).lazy()
So this one is a small adaptation of the solution of @ignoring_gravity. So the assumption in his solution is that the DataFrames have the same length and the Animal_id matches in both tables. If that was your goal I want to give you another solution because by subsetting (["is_dog"]
) you lose the possibility to use the lazy api.
df1.with_context(
df2.select(pl.col("Animal_type").is_in(["dog"]).cast(int).alias("is_dog"))
).select(["Animal_id", "age", "is_dog"]).collect()
So in case you want something more similar to your pandas snippet and because you wrote you don't want to have a join.
df1.with_context(
df2.filter(pl.col("Animal_type") == "dog").select(
pl.col("Animal_id").alias("valid_ids")
)
).with_columns(
[pl.col("Animal_id").is_in(pl.col("valid_ids")).cast(int).alias("is_dog")]
).collect()
So this would be the solution with a join. In my opinion the best solution regarding your example and example output, but maybe there are other reasons that speak against a join, which aren't apparent from your example.
df1.join(
df2.select(
["Animal_id", pl.col("Animal_type").is_in(["dog"]).cast(int).alias("is_dog")]
),
on=["Animal_id"],
).collect()