pythondataframepython-polars

In python polars filter and aggregate dict of lists


I have got a dataframe with string representation of json:

df = pl.DataFrame({ 
        "json": [
            '{"x":[0,1,2,3], "y":[10,20,30,40]}',
            '{"x":[0,1,2,3], "y":[10,20,30,40]}',
            '{"x":[0,1,2,3], "y":[10,20,30,40]}'
        ] 
    })
shape: (3, 1)
┌───────────────────────────────────┐
│ json                              │
│ ---                               │
│ str                               │
╞═══════════════════════════════════╡
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
└───────────────────────────────────┘

Now I would like to calculate the average for y where x > 0 and x < 3 for each row.


This is my current working solution:

First evaluate the string -> dict and then create a dataframe, which is filtered by x.

# import ast
df = df.with_columns(
    pl.col('json').map_elements(lambda x: pl.DataFrame(ast.literal_eval(x)).filter((pl.col('x') < 3) & (pl.col('x') > 0))['y'].mean())
)
shape: (3, 1)
┌──────┐
│ json │
│ ---  │
│ f64  │
╞══════╡
│ 25.0 │
│ 25.0 │
│ 25.0 │
└──────┘

This works fine, but for large datasets the apply functions is slowing down the process significantly.

Is there a more elegant and faster way of doing it?


Solution

  • json.loads()

    To parse JSON strings in Polars you can use .str.json_decode() (i.e. the equivalent of json.loads)

    df.with_columns(pl.col("json").str.json_decode())
    
    shape: (3, 1)
    ┌──────────────────────────────┐
    │ json                         │
    │ ---                          │
    │ struct[2]                    │
    ╞══════════════════════════════╡
    │ {[0, 1, … 3],[10, 20, … 40]} │
    │ {[0, 1, … 3],[10, 20, … 40]} │
    │ {[0, 1, … 3],[10, 20, … 40]} │
    └──────────────────────────────┘
    

    unnest()

    The "JSON object" becomes a Polars struct which you can .unnest into separate columns.

    As the lists are of the same length, you can .explode() them both at the same time.

    (df.with_columns(pl.col("json").str.json_decode())
       .unnest("json")
       .explode("x", "y")
    )
    
    shape: (12, 2)
    ┌─────┬─────┐
    │ x   ┆ y   │
    │ --- ┆ --- │
    │ i64 ┆ i64 │
    ╞═════╪═════╡
    │ 0   ┆ 10  │
    │ 1   ┆ 20  │
    │ 2   ┆ 30  │
    │ 3   ┆ 40  │
    │ …   ┆ …   │
    │ 0   ┆ 10  │
    │ 1   ┆ 20  │
    │ 2   ┆ 30  │
    │ 3   ┆ 40  │
    └─────┴─────┘
    

    If we add a row count before exploding, we can use it in a .group_by to rebuild each "row" after filtering.

    (df.with_row_count()
       .with_columns(pl.col("json").str.json_decode())
       .unnest("json")
       .explode("x", "y")
       .filter(pl.col("x").is_between(1, 2))
       .group_by("row_nr")
       .agg(pl.col("y").mean())
    )
    
    shape: (3, 2)
    ┌────────┬──────┐
    │ row_nr ┆ y    │
    │ ---    ┆ ---  │
    │ u32    ┆ f64  │
    ╞════════╪══════╡
    │ 0      ┆ 25.0 │
    │ 1      ┆ 25.0 │
    │ 2      ┆ 25.0 │
    └────────┴──────┘
    

    List/Struct Namespaces

    As well as unnesting/exploding into columns/rows, Polars also has the .list and .struct namespaces.

    (df.with_columns(parsed = pl.col("json").str.json_decode())
       .with_columns(mean = 
          pl.col("parsed").struct["y"].list.gather(
             pl.col("parsed").struct["x"].list.eval(
                pl.element().is_between(1, 2).arg_true()
             )
          )
          .list.mean()
       )
    )
    
    shape: (3, 3)
    ┌───────────────────────────────────┬──────────────────────────────┬──────┐
    │ json                              ┆ parsed                       ┆ mean │
    │ ---                               ┆ ---                          ┆ ---  │
    │ str                               ┆ struct[2]                    ┆ f64  │
    ╞═══════════════════════════════════╪══════════════════════════════╪══════╡
    │ {"x":[0,1,2,3], "y":[10,20,30,40… ┆ {[0, 1, … 3],[10, 20, … 40]} ┆ 25.0 │
    │ {"x":[0,1,2,3], "y":[10,20,30,40… ┆ {[0, 1, … 3],[10, 20, … 40]} ┆ 25.0 │
    │ {"x":[0,1,2,3], "y":[10,20,30,40… ┆ {[0, 1, … 3],[10, 20, … 40]} ┆ 25.0 │
    └───────────────────────────────────┴──────────────────────────────┴──────┘