pythonpython-polars

Filter a polars dataframe based on JSON in string column


I have a Polars dataframe like

df = pl.DataFrame({
    "tags": ['{"ref":"@1", "area": "livingroom", "type": "elec"}', '{"ref":"@2", "area": "kitchen"}', '{"ref":"@3", "type": "elec"}'],
     "name": ["a", "b", "c"],
})    
┌────────────────────────────────────────────────────┬──────┐
│ tags                                               ┆ name │
│ ---                                                ┆ ---  │
│ str                                                ┆ str  │
╞════════════════════════════════════════════════════╪══════╡
│ {"ref":"@1", "area": "livingroom", "type": "elec"} ┆ a    │
│ {"ref":"@2", "area": "kitchen"}                    ┆ b    │
│ {"ref":"@3", "type": "elec"}                       ┆ c    │
└────────────────────────────────────────────────────┴──────┘

What I would to do is create a filter function that filters dataframe based on the tags column. Particularly, I would like to only be left with rows where the tags column has an area key and a type key that has a value "elec".

How can I achieve this (ideally using the native expressions API)?


Solution

  • pl.Expr.str.json_path_match can be used to extract the first match of the JSON string with the a suitable path expression.

    (
        df
        .filter(
            pl.col("tags").str.json_path_match("$.area").is_not_null(),
            pl.col("tags").str.json_path_match("$.type") == "elec",
        )
    )
    
    shape: (1, 2)
    ┌────────────────────────────────────────────────────┬──────┐
    │ tags                                               ┆ name │
    │ ---                                                ┆ ---  │
    │ str                                                ┆ str  │
    ╞════════════════════════════════════════════════════╪══════╡
    │ {"ref":"@1", "area": "livingroom", "type": "elec"} ┆ a    │
    └────────────────────────────────────────────────────┴──────┘
    

    More generally, pl.Expr.str.json_decode ca be used to obtain a struct column with the information of the JSON. This struct can be unnested and used for any downstream filtering operation.

    (
        df
        .with_columns(
            pl.col("tags").str.json_decode()
        )
        .unnest("tags")
    )
    
    shape: (3, 4)
    ┌─────┬────────────┬──────┬──────┐
    │ ref ┆ area       ┆ type ┆ name │
    │ --- ┆ ---        ┆ ---  ┆ ---  │
    │ str ┆ str        ┆ str  ┆ str  │
    ╞═════╪════════════╪══════╪══════╡
    │ @1  ┆ livingroom ┆ elec ┆ a    │
    │ @2  ┆ kitchen    ┆ null ┆ b    │
    │ @3  ┆ null       ┆ elec ┆ c    │
    └─────┴────────────┴──────┴──────┘