dataframepython-polarsjsonpath

polars: `json_path_match` on `pl.element()` in `list.eval()` context


I'm trying to perform a JSON path match for each element (string) in a list. I'm observing the following behavior:

import polars as pl
data = [
    '{"text":"asdf","entityList":[{"mybool":true,"id":1},{"mybool":true,"id":2},{"mybool":false,"id":3}]}',
    '{"text":"asdf","entityList":[{"mybool":false,"id":1},{"mybool":true,"id":2},{"mybool":false,"id":3}]}',
]
 
df = pl.DataFrame({"data": [data]})
print(df)
# shape: (1, 1)
# ┌─────────────────────────────────┐
# │ data                            │
# │ ---                             │
# │ list[str]                       │
# ╞═════════════════════════════════╡
# │ ["{"text":"asdf","entityList":… │
# └─────────────────────────────────┘
 
expr1 = pl.col("data").list.eval(pl.element().str.json_path_match("$.entityList[*].id"))
print(df.select(expr1))
# shape: (1, 1)
# ┌────────────┐
# │ data       │
# │ ---        │
# │ list[str]  │
# ╞════════════╡
# │ ["1", "1"] │
# └────────────┘
 
expr2 = pl.col("data").list.eval(pl.element().str.json_path_match("$.entityList[*].id").flatten())
print(df.select(expr2))
# shape: (1, 1)
# ┌────────────┐
# │ data       │
# │ ---        │
# │ list[str]  │
# ╞════════════╡
# │ ["1", "1"] │
# └────────────┘

My understanding of JSON path is that $.entityList[*].id should extract the id of every element in entityList, therefore I'd expect the following result:

shape: (1, 1)
┌────────────────────────┐
│ data                   │
│ ---                    │
│ list[list[i64]]        │
╞════════════════════════╡
│ [[1, 2, 3], [1, 2, 3]] │
└────────────────────────┘

Am I misunderstanding how json_path_match operates on list elements or could this be a bug in how the nested lists are created?


Solution

  • As outlined in the comments, pl.Expr.str.json_path_match currently only extracts the first match.

    Still, the expected result can be obtained by decoding the entire JSON string and making suitable selections in the decoded struct.

    In the following, the three pl.DataFrame.with_columns blocks

    (
        df
        .with_columns(
            pl.col("data").list.eval(
                pl.element().str.json_decode()
            )
        )
        .with_columns(
            pl.col("data").list.eval(
                pl.element().struct.field("entityList")
            )
        )
        .with_columns(
            pl.col("data").list.eval(
                pl.element().list.eval(
                    pl.element().struct.field("id")
                )
            )
        )
    )
    
    shape: (1, 1)
    ┌────────────────────────┐
    │ data                   │
    │ ---                    │
    │ list[list[i64]]        │
    ╞════════════════════════╡
    │ [[1, 2, 3], [1, 2, 3]] │
    └────────────────────────┘