pythontransposepython-polars

transposing within a Polars df lead to "TypeError: not yet implemented: Nested object types"


I have this data:

┌────────────┬─────────────────────────────────────┐
│ col1       ┆ col2                                │
│ ---        ┆ ---                                 │
│ list[str]  ┆ list[list[str]]                     │
╞════════════╪═════════════════════════════════════╡
│ ["a"]      ┆ [["a"]]                             │
│ ["b", "c"] ┆ [["b", "c"], ["b", "c"], ["b", "c"] │
│ ["d"]      ┆ [["d"]]                             │
└────────────┴─────────────────────────────────────┘

I want to have all b's and all c's in the same list in in row 2, but as you can see the associations of b to b and c to c are not maintained in row 2. With pandas I used:

import pandas as pd

pddf = pd.DataFrame({"col1": [["a"], ["b", "c"], ["d"]],
                     "col2": [[["a"]], [["b", "c"], ["b", "c"], ["b", "c"]], [["d"]]]})

pddf["col2"] = pddf["col2"].apply(lambda listed: pd.DataFrame(listed).transpose().values.tolist())

print(pddf)

#      col1                    col2
# 0     [a]                   [[a]]
# 1  [b, c]  [[b, b, b], [c, c, c]]
# 2     [d]                   [[d]]

This is the desired result. I am trying to do the same with polars, by replacing pddf.transpose().values.tolist() with pldf.transpose().to_numpy().tolist(), but I always get and TypeError: not yet implemented: Nested object types. Are there any workarounds? Here is the complete polars code:

import polars as pl

pldf = pl.DataFrame({"col1": [["a"], ["b", "c"], ["d"]],
                     "col2": [[["a"]], [["b", "c"], ["b", "c"], ["b", "c"]], [["d"]]]})

pldf = pldf.with_columns(pl.col("col2").map_elements(lambda listed: pl.DataFrame(listed).transpose().to_numpy().tolist()))

print(pldf)

# TypeError: not yet implemented: Nested object types
# Hint: Try setting `strict=False` to allow passing data with mixed types.

Where would I need to apply the mentioned strict=False?

On an easier df pddf.transpose().values.tolist() and pldf.transpose().to_numpy().tolist() are the same:

import pandas as pd
import polars as pl

pd.DataFrame(
    {"col1": ["a", "b", "c"],
     "col2": ["d", "e", "f"]}
     ).transpose().values.tolist() == pl.DataFrame(
                                          {"col1": ["a", "b", "c"],
                                           "col2": ["d", "e", "f"]}
                                           ).transpose().to_numpy().tolist()

# True

Please keep as close as possible to the code, even though it's not ideal using .apply() or .map_elements(), but this is in a far greater project and I don't want to break anything else :).

(EDIT: I simplified the code a little since the second lambda wasn't really necessary for the question.)


Solution

  • As you've mentioned performance in the comments - this may also be of interest.

    Building upon @Herick's answer - when you add an index and explode, you are guaranteed to have sorted data.

    With sorted data, you can add a "row number per group" using rle() - which can be significantly faster.

    (df.with_row_index("row")
       .select("row", "col2")
       .explode("col2")
       .with_row_index()
       .explode("col2")
       .with_columns(
          pl.int_ranges(pl.col("index").rle().struct.field("len"))
            .flatten()
            .alias("list_idx")
       )
    )
    
    shape: (8, 4)
    ┌───────┬─────┬──────┬──────────┐
    │ index ┆ row ┆ col2 ┆ list_idx │
    │ ---   ┆ --- ┆ ---  ┆ ---      │
    │ u32   ┆ u32 ┆ str  ┆ i64      │
    ╞═══════╪═════╪══════╪══════════╡
    │ 0     ┆ 0   ┆ a    ┆ 0        │
    │ 1     ┆ 1   ┆ b    ┆ 0        │
    │ 1     ┆ 1   ┆ c    ┆ 1        │
    │ 2     ┆ 1   ┆ b    ┆ 0        │
    │ 2     ┆ 1   ┆ c    ┆ 1        │
    │ 3     ┆ 1   ┆ b    ┆ 0        │
    │ 3     ┆ 1   ┆ c    ┆ 1        │
    │ 4     ┆ 2   ┆ d    ┆ 0        │
    └───────┴─────┴──────┴──────────┘
    

    .map_batches() can be used to wrap the frame exploding/group_by logic into an expression context.

    df = pl.DataFrame({
        "col1": [["a"], ["b", "c"], ["d"]],
        "col2": [[["a"]], [["b", "c"], ["b", "c"], ["b", "c"]], [["d"]]],
        "col3": [[[1]], [[2, 3], [4, 5], [6, 7]], [[8]]]
    })
    
    df.with_columns(
       pl.col("col2", "col3").map_batches(lambda s:
          s.to_frame()
           .with_row_index("row")
           .explode(pl.last())
           .with_row_index()
           .explode(pl.last())
           .with_columns(
              pl.int_ranges(pl.col("index").rle().struct.field("len"))
                .flatten()
                .alias("index")
           )
           .group_by("index", "row", maintain_order=True) # order required for "horizontal concat"
           .agg(pl.last())
           .group_by("row", maintain_order=True)          # order required for "horizontal concat"
           .agg(pl.last())
           .select(pl.last())
           .to_series() # must return a series
       )
    )
    
    shape: (3, 3)
    ┌────────────┬─────────────────────────────────┬────────────────────────┐
    │ col1       ┆ col2                            ┆ col3                   │
    │ ---        ┆ ---                             ┆ ---                    │
    │ list[str]  ┆ list[list[str]]                 ┆ list[list[i64]]        │
    ╞════════════╪═════════════════════════════════╪════════════════════════╡
    │ ["a"]      ┆ [["a"]]                         ┆ [[1]]                  │
    │ ["b", "c"] ┆ [["b", "b", "b"], ["c", "c", "… ┆ [[2, 4, 6], [3, 5, 7]] │
    │ ["d"]      ┆ [["d"]]                         ┆ [[8]]                  │
    └────────────┴─────────────────────────────────┴────────────────────────┘
    

    As a basic comparison, if I sample 5_000_000 rows, I get:

    name time
    map_elements 61.19s
    map_batches 7.14s