I have a Polars dataframe like
df = pl.DataFrame({
"tags": ['{"ref":"@1", "area": "livingroom", "type": "elec"}', '{"ref":"@2", "area": "kitchen"}', '{"ref":"@3", "type": "elec"}'],
"name": ["a", "b", "c"],
})
┌────────────────────────────────────────────────────┬──────┐
│ tags ┆ name │
│ --- ┆ --- │
│ str ┆ str │
╞════════════════════════════════════════════════════╪══════╡
│ {"ref":"@1", "area": "livingroom", "type": "elec"} ┆ a │
│ {"ref":"@2", "area": "kitchen"} ┆ b │
│ {"ref":"@3", "type": "elec"} ┆ c │
└────────────────────────────────────────────────────┴──────┘
What I would to do is create a filter function that filters dataframe based on the tags column. Particularly, I would like to only be left with rows where the tags
column has an area
key and a type
key that has a value "elec"
.
How can I achieve this (ideally using the native expressions API)?
pl.Expr.str.json_path_match
can be used to extract the first match of the JSON string with the a suitable path expression.
(
df
.filter(
pl.col("tags").str.json_path_match("$.area").is_not_null(),
pl.col("tags").str.json_path_match("$.type") == "elec",
)
)
shape: (1, 2)
┌────────────────────────────────────────────────────┬──────┐
│ tags ┆ name │
│ --- ┆ --- │
│ str ┆ str │
╞════════════════════════════════════════════════════╪══════╡
│ {"ref":"@1", "area": "livingroom", "type": "elec"} ┆ a │
└────────────────────────────────────────────────────┴──────┘
More generally, pl.Expr.str.json_decode
ca be used to obtain a struct column with the information of the JSON. This struct can be unnested and used for any downstream filtering operation.
(
df
.with_columns(
pl.col("tags").str.json_decode()
)
.unnest("tags")
)
shape: (3, 4)
┌─────┬────────────┬──────┬──────┐
│ ref ┆ area ┆ type ┆ name │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═════╪════════════╪══════╪══════╡
│ @1 ┆ livingroom ┆ elec ┆ a │
│ @2 ┆ kitchen ┆ null ┆ b │
│ @3 ┆ null ┆ elec ┆ c │
└─────┴────────────┴──────┴──────┘