What puzzles me is that when running
class Cat(pydantic.BaseModel):
name: str
age: int
cats = [Cat(name="a", age=1), Cat(name="b", age=2)]
df = pl.DataFrame({"cats": cats})
df = df.with_columns(pl.lit(0).alias("acq_num"))
def wrap(batch):
return Cat(name="c", age=3)
df = df.group_by("acq_num").agg(pl.col("cats").map_batches(wrap, return_dtype=pl.Struct).alias("cats"))
type(df["cats"][0][0])
# dict
the resulting entries are dicts, even though the function "wrap" returns a Cat. So polars automatically converts it to a dict, calling model_dump of pydantic?
Changing to
df = df.group_by("acq_num").agg(pl.col("cats").map_batches(wrap, return_dtype=pl.Object).alias("cats"))
results in the error:
SchemaError: expected output type 'Object("object", None)', got 'Struct([Field { name: "name", dtype: String }, Field { name: "age", dtype: Int64 }])'; set `return_dtype` to the proper datatype
I am confused by this conversion happening. How can I prevent it?
Polars already infers the dtype at pl.DataFrame
:
cats = [Cat(name="a", age=1), Cat(name="b", age=2)]
df = pl.DataFrame({"cats": cats})
df
shape: (2, 1)
┌───────────┐
│ cats │
│ --- │
│ struct[2] │
╞═══════════╡
│ {"a",1} │
│ {"b",2} │
└───────────┘
df.schema
Schema([('cats', Struct({'name': String, 'age': Int64}))])
Rather than accessing model_dump
, it looks at the dict
storing the object's attributes and converts that into a struct:
Cat(name="a", age=1).__dict__
{'name': 'a', 'age': 1}
To avoid this behaviour, set pl.Object
for schema
:
df = pl.DataFrame({"cats": cats}, schema={'cats': pl.Object})
df
shape: (2, 1)
┌────────────────┐
│ cats │
│ --- │
│ object │
╞════════════════╡
│ name='a' age=1 │
│ name='b' age=2 │
└────────────────┘
type(df.item(0, 'cats'))
__main__.Cat
Now, with map_batches
, the expected output is a pl.Series
(or a np.array
, which it converts). So, here too you need to specify the dtype at construction:
# import numpy as np
def wrap_batch(batch):
return pl.Series([Cat(name="c", age=3)], dtype=pl.Object)
# return np.array([Cat(name="c", age=3)], dtype=object) will also work
df = df.group_by("acq_num").agg(
pl.col("cats").map_batches(wrap_batch, return_dtype=pl.Object).alias("cats")
)
(Note that the above also works without return_dtype=pl.Object
, "but [this] is considered a bug in the user's query".)
type(df.explode('cats').item(0, 'cats'))
__main__.Cat
# you could add `returns_scalar=True` in this trivial example to avoid `explode`
With map_elements
, you would only need to worry about the dtype in this case via return_dtype
:
def wrap_element(element):
return Cat(name="c", age=3)
df = df.group_by("acq_num").agg(
pl.col("cats").map_elements(wrap_element, return_dtype=pl.Object)
.alias("cats")
)
type(df.item(0, 'cats'))
__main__.Cat