I'm trying to perform a JSON path match for each element (string) in a list. I'm observing the following behavior:
import polars as pl
data = [
'{"text":"asdf","entityList":[{"mybool":true,"id":1},{"mybool":true,"id":2},{"mybool":false,"id":3}]}',
'{"text":"asdf","entityList":[{"mybool":false,"id":1},{"mybool":true,"id":2},{"mybool":false,"id":3}]}',
]
df = pl.DataFrame({"data": [data]})
print(df)
# shape: (1, 1)
# ┌─────────────────────────────────┐
# │ data │
# │ --- │
# │ list[str] │
# ╞═════════════════════════════════╡
# │ ["{"text":"asdf","entityList":… │
# └─────────────────────────────────┘
expr1 = pl.col("data").list.eval(pl.element().str.json_path_match("$.entityList[*].id"))
print(df.select(expr1))
# shape: (1, 1)
# ┌────────────┐
# │ data │
# │ --- │
# │ list[str] │
# ╞════════════╡
# │ ["1", "1"] │
# └────────────┘
expr2 = pl.col("data").list.eval(pl.element().str.json_path_match("$.entityList[*].id").flatten())
print(df.select(expr2))
# shape: (1, 1)
# ┌────────────┐
# │ data │
# │ --- │
# │ list[str] │
# ╞════════════╡
# │ ["1", "1"] │
# └────────────┘
My understanding of JSON path is that $.entityList[*].id
should extract the id
of every element in entityList
, therefore I'd expect the following result:
shape: (1, 1)
┌────────────────────────┐
│ data │
│ --- │
│ list[list[i64]] │
╞════════════════════════╡
│ [[1, 2, 3], [1, 2, 3]] │
└────────────────────────┘
Am I misunderstanding how json_path_match
operates on list elements or could this be a bug in how the nested lists are created?
As outlined in the comments, pl.Expr.str.json_path_match
currently only extracts the first match.
Still, the expected result can be obtained by decoding the entire JSON string and making suitable selections in the decoded struct.
In the following, the three pl.DataFrame.with_columns
blocks
pl.Expr.str.json_decode
),entityList
field in each struct (using pl.Expr.struct.field
),id
field in each struct in each of the inner lists.(
df
.with_columns(
pl.col("data").list.eval(
pl.element().str.json_decode()
)
)
.with_columns(
pl.col("data").list.eval(
pl.element().struct.field("entityList")
)
)
.with_columns(
pl.col("data").list.eval(
pl.element().list.eval(
pl.element().struct.field("id")
)
)
)
)
shape: (1, 1)
┌────────────────────────┐
│ data │
│ --- │
│ list[list[i64]] │
╞════════════════════════╡
│ [[1, 2, 3], [1, 2, 3]] │
└────────────────────────┘