pythonpython-polars

How to instantiate a single element Array/List in Polars expressions efficiently?


I need to convert each element in a polars df into the following structure:

{
    "value": "A",
    "lineItemName": "value",
    "dimensions": [
        {
            "itemCode": 1,
            "dimensionName": "Clients"
        }
    ]
}

where value corresponds to the value of that element, lineItemName to the column name, itemCode the value held in the key column in the row of that element and dimensionName is a given literal.

For example

df = pl.DataFrame({"key": [1, 2, 3, 4, 5], "value": ["A", "B", "C", "D", "E"]})

Should result in:

shape: (5, 1)
╭─────────────────────────╮
│ value                   │
│ ---                     │
│ struct[3]               │
╞═════════════════════════╡
│ {"A","value",[{1,"D"}]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"B","value",[{2,"D"}]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"C","value",[{3,"D"}]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"D","value",[{4,"D"}]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"E","value",[{5,"D"}]} │
╰─────────────────────────╯

enter image description here

My current implementation:

df = df.with_columns(
    pl.struct(
        pl.col(col).alias("value"),
        pl.lit(col).alias("lineItemName"),
        pl.concat_list(
            pl.struct(pl.col("key").alias("itemCode"), pl.lit("D").alias("dimensionName"))
        ).alias("dimensions"),
    ).alias(col)
    for col in df.columns
    if not col == "key"
).drop("key")

My issue is with the pl.concat_list() expression. The list holding the dimension struct is in my case guaranteed to always only hold one single element. That is why I am seeking a way to avoid taking the significant (and in my case unnecessary) performance hit of pl.concat_list().

Ideally, I'd be able to just:

pl.lit(
    [pl.struct(pl.col("key").alias("itemCode"), pl.lit("D").alias("dimensionName"))]
).alias("dimensions")

but this for the time being raises TypeError: not yet implemented: Nested object types.

I have tried variations of the above, but I cannot seem to avoid running into the nested expression at some point. Is there any way I can cleanly instantiate this single element list or better yet Array?


Solution

  • You could try .reshape((1,1)) like this

    df.select(
        pl.struct(
            pl.col(col).alias("value"),
            pl.lit(col).alias("lineItemName"),
            pl.struct(pl.col("key").alias("itemCode"), pl.lit("D").alias("dimensionName"))
            .reshape((1,1,)).alias("dimensions"),
        ).alias(col)
        for col in df.columns
        if not col == "key"
    )
    

    If you want keep it as a list (for whatever reason) then you can chain is as .reshape((1,1)).arr.to_list()

    There shouldn't be any appreciable overhead for an Array b/c it only needs to save metadata on the width but the underlying data doesn't need to change, move, or copy.