pythonpython-3.xpython-polarspydantic

Pydantic objects as elements in a polars dataframe get automatically converted to dicts (structs)?


What puzzles me is that when running

class Cat(pydantic.BaseModel):
    name: str
    age: int

cats = [Cat(name="a", age=1), Cat(name="b", age=2)]
df = pl.DataFrame({"cats": cats})
df = df.with_columns(pl.lit(0).alias("acq_num"))

def wrap(batch):
    return Cat(name="c", age=3)

df = df.group_by("acq_num").agg(pl.col("cats").map_batches(wrap, return_dtype=pl.Struct).alias("cats"))

type(df["cats"][0][0])
# dict

the resulting entries are dicts, even though the function "wrap" returns a Cat. So polars automatically converts it to a dict, calling model_dump of pydantic?

Changing to

df = df.group_by("acq_num").agg(pl.col("cats").map_batches(wrap, return_dtype=pl.Object).alias("cats"))

results in the error:

SchemaError: expected output type 'Object("object", None)', got 'Struct([Field { name: "name", dtype: String }, Field { name: "age", dtype: Int64 }])'; set `return_dtype` to the proper datatype

I am confused by this conversion happening. How can I prevent it?


Solution

  • Polars already infers the dtype at pl.DataFrame:

    cats = [Cat(name="a", age=1), Cat(name="b", age=2)]
    df = pl.DataFrame({"cats": cats})
    
    df
    
    shape: (2, 1)
    ┌───────────┐
    │ cats      │
    │ ---       │
    │ struct[2] │
    ╞═══════════╡
    │ {"a",1}   │
    │ {"b",2}   │
    └───────────┘
    
    df.schema
    
    Schema([('cats', Struct({'name': String, 'age': Int64}))])
    

    Rather than accessing model_dump, it looks at the dict storing the object's attributes and converts that into a struct:

    Cat(name="a", age=1).__dict__
    
    {'name': 'a', 'age': 1}
    

    To avoid this behaviour, set pl.Object for schema:

    df = pl.DataFrame({"cats": cats}, schema={'cats': pl.Object})
    
    df
    
    shape: (2, 1)
    ┌────────────────┐
    │ cats           │
    │ ---            │
    │ object         │
    ╞════════════════╡
    │ name='a' age=1 │
    │ name='b' age=2 │
    └────────────────┘
    
    type(df.item(0, 'cats'))
    __main__.Cat
    

    Now, with map_batches, the expected output is a pl.Series (or a np.array, which it converts). So, here too you need to specify the dtype at construction:

    # import numpy as np
    
    def wrap_batch(batch):
        return pl.Series([Cat(name="c", age=3)], dtype=pl.Object)
        # return np.array([Cat(name="c", age=3)], dtype=object) will also work
    
    df = df.group_by("acq_num").agg(
        pl.col("cats").map_batches(wrap_batch, return_dtype=pl.Object).alias("cats")
        )
    

    (Note that the above also works without return_dtype=pl.Object, "but [this] is considered a bug in the user's query".)

    type(df.explode('cats').item(0, 'cats'))
    __main__.Cat
    
    # you could add `returns_scalar=True` in this trivial example to avoid `explode`
    

    With map_elements, you would only need to worry about the dtype in this case via return_dtype:

    def wrap_element(element):
        return Cat(name="c", age=3)
    
    df = df.group_by("acq_num").agg(
        pl.col("cats").map_elements(wrap_element, return_dtype=pl.Object)
        .alias("cats")
        )
    
    type(df.item(0, 'cats'))
    __main__.Cat