I have got a dataframe with string representation of json:
df = pl.DataFrame({
"json": [
'{"x":[0,1,2,3], "y":[10,20,30,40]}',
'{"x":[0,1,2,3], "y":[10,20,30,40]}',
'{"x":[0,1,2,3], "y":[10,20,30,40]}'
]
})
shape: (3, 1)
┌───────────────────────────────────┐
│ json │
│ --- │
│ str │
╞═══════════════════════════════════╡
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
└───────────────────────────────────┘
Now I would like to calculate the average for y where x > 0 and x < 3 for each row.
This is my current working solution:
First evaluate the string -> dict and then create a dataframe, which is filtered by x.
# import ast
df = df.with_columns(
pl.col('json').map_elements(lambda x: pl.DataFrame(ast.literal_eval(x)).filter((pl.col('x') < 3) & (pl.col('x') > 0))['y'].mean())
)
shape: (3, 1)
┌──────┐
│ json │
│ --- │
│ f64 │
╞══════╡
│ 25.0 │
│ 25.0 │
│ 25.0 │
└──────┘
This works fine, but for large datasets the apply functions is slowing down the process significantly.
Is there a more elegant and faster way of doing it?
To parse JSON strings in Polars you can use .str.json_decode()
(i.e. the equivalent of json.loads
)
df.with_columns(pl.col("json").str.json_decode())
shape: (3, 1)
┌──────────────────────────────┐
│ json │
│ --- │
│ struct[2] │
╞══════════════════════════════╡
│ {[0, 1, … 3],[10, 20, … 40]} │
│ {[0, 1, … 3],[10, 20, … 40]} │
│ {[0, 1, … 3],[10, 20, … 40]} │
└──────────────────────────────┘
The "JSON object" becomes a Polars struct which you can .unnest
into separate columns.
As the lists are of the same length, you can .explode()
them both at the same time.
(df.with_columns(pl.col("json").str.json_decode())
.unnest("json")
.explode("x", "y")
)
shape: (12, 2)
┌─────┬─────┐
│ x ┆ y │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 10 │
│ 1 ┆ 20 │
│ 2 ┆ 30 │
│ 3 ┆ 40 │
│ … ┆ … │
│ 0 ┆ 10 │
│ 1 ┆ 20 │
│ 2 ┆ 30 │
│ 3 ┆ 40 │
└─────┴─────┘
If we add a row count before exploding, we can use it in a .group_by
to rebuild each "row" after filtering.
(df.with_row_count()
.with_columns(pl.col("json").str.json_decode())
.unnest("json")
.explode("x", "y")
.filter(pl.col("x").is_between(1, 2))
.group_by("row_nr")
.agg(pl.col("y").mean())
)
shape: (3, 2)
┌────────┬──────┐
│ row_nr ┆ y │
│ --- ┆ --- │
│ u32 ┆ f64 │
╞════════╪══════╡
│ 0 ┆ 25.0 │
│ 1 ┆ 25.0 │
│ 2 ┆ 25.0 │
└────────┴──────┘
As well as unnesting/exploding into columns/rows, Polars also has the .list
and .struct
namespaces.
(df.with_columns(parsed = pl.col("json").str.json_decode())
.with_columns(mean =
pl.col("parsed").struct["y"].list.gather(
pl.col("parsed").struct["x"].list.eval(
pl.element().is_between(1, 2).arg_true()
)
)
.list.mean()
)
)
shape: (3, 3)
┌───────────────────────────────────┬──────────────────────────────┬──────┐
│ json ┆ parsed ┆ mean │
│ --- ┆ --- ┆ --- │
│ str ┆ struct[2] ┆ f64 │
╞═══════════════════════════════════╪══════════════════════════════╪══════╡
│ {"x":[0,1,2,3], "y":[10,20,30,40… ┆ {[0, 1, … 3],[10, 20, … 40]} ┆ 25.0 │
│ {"x":[0,1,2,3], "y":[10,20,30,40… ┆ {[0, 1, … 3],[10, 20, … 40]} ┆ 25.0 │
│ {"x":[0,1,2,3], "y":[10,20,30,40… ┆ {[0, 1, … 3],[10, 20, … 40]} ┆ 25.0 │
└───────────────────────────────────┴──────────────────────────────┴──────┘