I have a polars
dataframe like such:
print(
pl.DataFrame(
{
"file": ["A", "A", "A", "B", "B", "B"],
"user": ["u1", "u2", "u3", "u1", "u2", "u3"],
"data1": [1, 2, 3, 4, 5, 6],
"data2": [7, 8, 9, 10, 11, 12],
"data3": [13, 14, 15, 16, 17, 18],
}
)
)
shape: (6, 5)
┌──────┬──────┬───────┬───────┬───────┐
│ file ┆ user ┆ data1 ┆ data2 ┆ data3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╪═══════╪═══════╡
│ A ┆ u1 ┆ 1 ┆ 7 ┆ 13 │
│ A ┆ u2 ┆ 2 ┆ 8 ┆ 14 │
│ A ┆ u3 ┆ 3 ┆ 9 ┆ 15 │
│ B ┆ u1 ┆ 4 ┆ 10 ┆ 16 │
│ B ┆ u2 ┆ 5 ┆ 11 ┆ 17 │
│ B ┆ u3 ┆ 6 ┆ 12 ┆ 18 │
└──────┴──────┴───────┴───────┴───────┘
And I would like to turn it into a bested dictionary like this
{
'file': {
'A': {
'user': {
'u1': {'data1': 1, 'data2': 7, 'data3': 13},
'u2': {'data1': 2, 'data2': 8, 'data3': 14},
'u3': {'data1': 3, 'data2': 9, 'data3': 15}
}
},
'B': {
'user': {
'u1': {'data1': 4, 'data2': 10, 'data3': 16},
'u2': {'data1': 5, 'data2': 11, 'data3': 17},
'u3': {'data1': 6, 'data2': 12, 'data3': 18}
}
}
}
}
I tried my luck with grouping and selecting structs, but I'm unable to do the conversion using only polars
functions. What's the right approach here?
EDIT:
I tried two things:
rows_by_key
:df.rows_by_key(key=["file", "user"], unique=True, named=True)
but that gives the keys as tuples, i.e.,
{('A', 'u1'): {'data1': 1, 'data2': 7, 'data3': 13},
('A', 'u2'): {'data1': 2, 'data2': 8, 'data3': 14},
('A', 'u3'): {'data1': 3, 'data2': 9, 'data3': 15},
('B', 'u1'): {'data1': 4, 'data2': 10, 'data3': 16},
('B', 'u2'): {'data1': 5, 'data2': 11, 'data3': 17},
('B', 'u3'): {'data1': 6, 'data2': 12, 'data3': 18}}
d = {}
for file, df_file in (df).group_by("file", maintain_order=True):
d_user = {}
for user, df_file_user in df_file.group_by("user", maintain_order=True):
d_user[user[0]] = df_file_user.drop("file", "user").to_dicts()[0]
d[file[0]] = {"user": d_user}
print({"file": d})
This yields the right output:
{
'file': {
'A': {
'user': {
'u1': {'data1': 1, 'data2': 7, 'data3': 13},
'u2': {'data1': 2, 'data2': 8, 'data3': 14},
'u3': {'data1': 3, 'data2': 9, 'data3': 15}
}
},
'B': {
'user': {
'u1': {'data1': 4, 'data2': 10, 'data3': 16},
'u2': {'data1': 5, 'data2': 11, 'data3': 17},
'u3': {'data1': 6, 'data2': 12, 'data3': 18}
}
}
}
}
But this feels cumbersome and I wonder whether there's a more direct way of doing this.
Starting with what you already had:
import polars as pl
df = pl.DataFrame(
{
"file": ["A", "A", "A", "B", "B", "B"],
"user": ["u1", "u2", "u3", "u1", "u2", "u3"],
"data1": [1, 2, 3, 4, 5, 6],
"data2": [7, 8, 9, 10, 11, 12],
"data3": [13, 14, 15, 16, 17, 18],
}
)
rows = df.rows_by_key(key=["file", "user"], unique=True, named=True)
result = {
"file": {
file: {
user: values
for (f, user), values in rows.items()
if f == file
}
for file in dict.fromkeys(f for f, _ in rows)
}
}
This avoids the nested loop you have, but still loops twice in the comprehension, after having constructed rows
like you did.
It would perhaps be preferable to do it all in one go like this instead:
result = {
"file": {
file: {
row["user"]: {k: row[k] for k in row if k not in ("file", "user")}
for row in group.to_dicts()
}
for (file,), group in df.partition_by(["file"], as_dict=True).items()
}
}
The result (for both):
from pprint import pprint
pprint(result)
Output:
{'file': {'A': {'u1': {'data1': 1, 'data2': 7, 'data3': 13},
'u2': {'data1': 2, 'data2': 8, 'data3': 14},
'u3': {'data1': 3, 'data2': 9, 'data3': 15}},
'B': {'u1': {'data1': 4, 'data2': 10, 'data3': 16},
'u2': {'data1': 5, 'data2': 11, 'data3': 17},
'u3': {'data1': 6, 'data2': 12, 'data3': 18}}}}
Perhaps a more appropriate question would be: why do you need it in the format you proposed? There may be a better way to achieve your ultimate goal from the original data, if you have no control over how the original data is structured in the first place.