pythonpython-polars

Turn polars dataframe into nested dictionary


I have a polars dataframe like such:

print(
    pl.DataFrame(
        {
            "file": ["A", "A", "A", "B", "B", "B"],
            "user": ["u1", "u2", "u3", "u1", "u2", "u3"],
            "data1": [1, 2, 3, 4, 5, 6],
            "data2": [7, 8, 9, 10, 11, 12],
            "data3": [13, 14, 15, 16, 17, 18],
        }
    )
)



shape: (6, 5)
┌──────┬──────┬───────┬───────┬───────┐
│ file ┆ user ┆ data1 ┆ data2 ┆ data3 │
│ ---  ┆ ---  ┆ ---   ┆ ---   ┆ ---   │
│ str  ┆ str  ┆ i64   ┆ i64   ┆ i64   │
╞══════╪══════╪═══════╪═══════╪═══════╡
│ A    ┆ u1   ┆ 1     ┆ 7     ┆ 13    │
│ A    ┆ u2   ┆ 2     ┆ 8     ┆ 14    │
│ A    ┆ u3   ┆ 3     ┆ 9     ┆ 15    │
│ B    ┆ u1   ┆ 4     ┆ 10    ┆ 16    │
│ B    ┆ u2   ┆ 5     ┆ 11    ┆ 17    │
│ B    ┆ u3   ┆ 6     ┆ 12    ┆ 18    │
└──────┴──────┴───────┴───────┴───────┘

And I would like to turn it into a bested dictionary like this

{
    'file': {
        'A': {
            'user': {
                'u1': {'data1': 1, 'data2': 7, 'data3': 13},
                'u2': {'data1': 2, 'data2': 8, 'data3': 14},
                'u3': {'data1': 3, 'data2': 9, 'data3': 15}
            }
        },
        'B': {
            'user': {
                'u1': {'data1': 4, 'data2': 10, 'data3': 16},
                'u2': {'data1': 5, 'data2': 11, 'data3': 17},
                'u3': {'data1': 6, 'data2': 12, 'data3': 18}
            }
        }
    }
}

I tried my luck with grouping and selecting structs, but I'm unable to do the conversion using only polars functions. What's the right approach here?

EDIT:

I tried two things:

  1. Using rows_by_key:
df.rows_by_key(key=["file", "user"], unique=True, named=True)

but that gives the keys as tuples, i.e.,

{('A', 'u1'): {'data1': 1, 'data2': 7, 'data3': 13},
 ('A', 'u2'): {'data1': 2, 'data2': 8, 'data3': 14},
 ('A', 'u3'): {'data1': 3, 'data2': 9, 'data3': 15},
 ('B', 'u1'): {'data1': 4, 'data2': 10, 'data3': 16},
 ('B', 'u2'): {'data1': 5, 'data2': 11, 'data3': 17},
 ('B', 'u3'): {'data1': 6, 'data2': 12, 'data3': 18}}
  1. I also tried to manually loop over the groups:
d = {}
for file, df_file in (df).group_by("file", maintain_order=True):
    d_user = {}
    for user, df_file_user in df_file.group_by("user", maintain_order=True):
        d_user[user[0]] = df_file_user.drop("file", "user").to_dicts()[0]
    d[file[0]] = {"user": d_user}
print({"file": d})

This yields the right output:

{
    'file': {
        'A': {
            'user': {
                'u1': {'data1': 1, 'data2': 7, 'data3': 13},
                'u2': {'data1': 2, 'data2': 8, 'data3': 14},
                'u3': {'data1': 3, 'data2': 9, 'data3': 15}
            }
        },
        'B': {
            'user': {
                'u1': {'data1': 4, 'data2': 10, 'data3': 16},
                'u2': {'data1': 5, 'data2': 11, 'data3': 17},
                'u3': {'data1': 6, 'data2': 12, 'data3': 18}
            }
        }
    }
}

But this feels cumbersome and I wonder whether there's a more direct way of doing this.


Solution

  • Starting with what you already had:

    import polars as pl
    
    df = pl.DataFrame(
        {
            "file": ["A", "A", "A", "B", "B", "B"],
            "user": ["u1", "u2", "u3", "u1", "u2", "u3"],
            "data1": [1, 2, 3, 4, 5, 6],
            "data2": [7, 8, 9, 10, 11, 12],
            "data3": [13, 14, 15, 16, 17, 18],
        }
    )
    
    rows = df.rows_by_key(key=["file", "user"], unique=True, named=True)
    
    result = {
        "file": {
            file: {
                user: values
                for (f, user), values in rows.items()
                if f == file
            }
            for file in dict.fromkeys(f for f, _ in rows)
        }
    }
    

    This avoids the nested loop you have, but still loops twice in the comprehension, after having constructed rows like you did.

    It would perhaps be preferable to do it all in one go like this instead:

    result = {
        "file": {
            file: {
                row["user"]: {k: row[k] for k in row if k not in ("file", "user")}
                for row in group.to_dicts()
            }
            for (file,), group in df.partition_by(["file"], as_dict=True).items()
        }
    }
    
    

    The result (for both):

    from pprint import pprint
    pprint(result)
    

    Output:

    {'file': {'A': {'u1': {'data1': 1, 'data2': 7, 'data3': 13},
                    'u2': {'data1': 2, 'data2': 8, 'data3': 14},
                    'u3': {'data1': 3, 'data2': 9, 'data3': 15}},
              'B': {'u1': {'data1': 4, 'data2': 10, 'data3': 16},
                    'u2': {'data1': 5, 'data2': 11, 'data3': 17},
                    'u3': {'data1': 6, 'data2': 12, 'data3': 18}}}}
    

    Perhaps a more appropriate question would be: why do you need it in the format you proposed? There may be a better way to achieve your ultimate goal from the original data, if you have no control over how the original data is structured in the first place.