pythonpython-polars

Polars fails to create a new dataframe using with_columns() when creating new columns which contain a struct column


I'm new to polars and encountering a confusing error.

I'm trying to take several array columns and zip them into struct columns. When I try to do this with with_columns I encounter the error:

ValueError: can only call `.item()` if the dataframe is of shape (1, 1), or if explicit row/col values are provided; frame has shape (4, 2)

Here is code to reproduce this problem:

df = pl.DataFrame(
    {
        "a": [[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4]],
        "b": [[1, 2, 3, 5],[1, 2, 3, 5],[1, 2, 3, 5],[1, 2, 3, 5]],
        "c": [[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4]],
        "d": ['a', 'b', 'c', 'd']
    }
)
df.with_columns([
    (df.explode('a', 'b')
    .select(
        "a",
        "b",
        "d",
        pl.struct('a', 'b').alias("test_1"))
    .group_by("d")
    .agg("test_1")),
    (df.explode('b', 'c')
    .select(
        "c",
        "b",
        "d",
        pl.struct('b', 'c').alias("test_2"))
    .group_by("d")
    .agg("test_2")),   
]
)

With a single struct column (and no list in the method call) this works just as expected and yields the output:


a   b   c   d   test_1
list[i64]   list[i64]   list[i64]   str list[struct[2]]
[1, 2, … 4] [1, 2, … 5] [1, 2, … 4] "d" [{1,1}, {2,2}, … {4,5}]
[1, 2, … 4] [1, 2, … 5] [1, 2, … 4] "b" [{1,1}, {2,2}, … {4,5}]
[1, 2, … 4] [1, 2, … 5] [1, 2, … 4] "c" [{1,1}, {2,2}, … {4,5}]
[1, 2, … 4] [1, 2, … 5] [1, 2, … 4] "a" [{1,1}, {2,2}, … {4,5}]

However, even putting this single operation into a list in the method call creates this error:

df.with_columns([
    (df.explode('a', 'b')
    .select(
        "a",
        "b",
        "d",
        pl.struct('a', 'b').alias("test_1"))
    .group_by("d")
    .agg("test_1")),]
)

I'm sure this is some sort of simple error, but I cant' find any information on the cause and solution to this.


Solution

  • Compute test_1 and test_2 as separate DataFrames.
    Use join to combine test_1 and test_2 with the original DataFrame.
    Avoid passing complete DataFrames to with_columns()

    
    import polars as pl
    
    df = pl.DataFrame(
        {
            "a": [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]],
            "b": [[1, 2, 3, 5], [1, 2, 3, 5], [1, 2, 3, 5], [1, 2, 3, 5]],
            "c": [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]],
            "d": ['a', 'b', 'c', 'd']
        }
    )
    
    test_1 = (
        df.explode("a", "b")
        .select(
            "a",
            "b",
            "d",
            pl.struct("a", "b").alias("test_1")
        )
        .group_by("d")
        .agg(pl.col("test_1"))
    )
    
    test_2 = (
        df.explode("b", "c")
        .select(
            "c",
            "b",
            "d",
            pl.struct("b", "c").alias("test_2")
        )
        .group_by("d")
        .agg(pl.col("test_2"))
    )
    
    result = df.join(test_1, on="d").join(test_2, on="d")
    
    print(result)
    

    Result is in graph.

    enter image description here