I need to convert each element in a polars df into the following structure:
{
"value": "A",
"lineItemName": "value",
"dimensions": [
{
"itemCode": 1,
"dimensionName": "Clients"
}
]
}
where value
corresponds to the value of that element, lineItemName
to the column name, itemCode
the value held in
the key column in the row of that element and dimensionName
is a given literal.
For example
df = pl.DataFrame({"key": [1, 2, 3, 4, 5], "value": ["A", "B", "C", "D", "E"]})
Should result in:
shape: (5, 1)
╭─────────────────────────╮
│ value │
│ --- │
│ struct[3] │
╞═════════════════════════╡
│ {"A","value",[{1,"D"}]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"B","value",[{2,"D"}]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"C","value",[{3,"D"}]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"D","value",[{4,"D"}]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"E","value",[{5,"D"}]} │
╰─────────────────────────╯
My current implementation:
df = df.with_columns(
pl.struct(
pl.col(col).alias("value"),
pl.lit(col).alias("lineItemName"),
pl.concat_list(
pl.struct(pl.col("key").alias("itemCode"), pl.lit("D").alias("dimensionName"))
).alias("dimensions"),
).alias(col)
for col in df.columns
if not col == "key"
).drop("key")
My issue is with the pl.concat_list()
expression. The list holding the dimension struct is in my case guaranteed to
always only hold one single element. That is why I am seeking a way to avoid taking the significant (and in my case
unnecessary) performance hit of pl.concat_list()
.
Ideally, I'd be able to just:
pl.lit(
[pl.struct(pl.col("key").alias("itemCode"), pl.lit("D").alias("dimensionName"))]
).alias("dimensions")
but this for the time being raises TypeError: not yet implemented: Nested object types
.
I have tried variations of the above, but I cannot seem to avoid running into the nested expression at some point. Is there any way I can cleanly instantiate this single element list or better yet Array?
You could try .reshape((1,1))
like this
df.select(
pl.struct(
pl.col(col).alias("value"),
pl.lit(col).alias("lineItemName"),
pl.struct(pl.col("key").alias("itemCode"), pl.lit("D").alias("dimensionName"))
.reshape((1,1,)).alias("dimensions"),
).alias(col)
for col in df.columns
if not col == "key"
)
If you want keep it as a list (for whatever reason) then you can chain is as .reshape((1,1)).arr.to_list()
There shouldn't be any appreciable overhead for an Array b/c it only needs to save metadata on the width but the underlying data doesn't need to change, move, or copy.