Note: you can install the specific pandera version using
pip install pre 'pandera[polars]'
We are trying a simple validation example using polars. We cant understand the problem or why it originates. But it throws polars.exceptions.ComputeError exception when any of the validation fails and there is null in data.
For example, in the code below, the dummy data contains extract_date feature with a None. It runs fine if the case_id are all int convertible string but throws the exception if any of the case_id is not int convertible.
Here is the code:
import pandera.polars as pa
import polars as pl
from datetime import date
import json
class CaseSchema(pa.DataFrameModel):
case_id: int = pa.Field(nullable=False, unique=True, coerce=True)
gdwh_portfolio_id: str = pa.Field(nullable=False, unique=True, coerce=True)
extract_date: date = pa.Field(nullable=True, coerce=True)
class Config:
drop_invalid_rows = True
invalid_lf = pl.DataFrame({
#"case_id": ["1", "2", "3"],
"case_id": ["1", "2", "abc"],
"gdwh_portfolio_id": ["d", "e", "f"],
"extract_date": [date(2024,1,1), date(2024,1,2), None]
})
try:
CaseSchema.validate(invalid_lf, lazy=True)
except pa.errors.SchemaErrors as e:
print(json.dumps(e.message, indent=4))
It gives: 'failure_case' for 1 out of 1 values: [{"abc","f",null}]
If you uncomment "case_id": ["1", "2", "3"]
, and comment "case_id": ["1", "2", "abc"]
it runs fine.
Not sure why it panics when there are nulls. If there are no nulls in the data it works fine.
The trace we get is:
> Traceback (most recent call last):
> File "<frozen runpy>", line 198, in _run_module_as_main
> File "<frozen runpy>", line 88, in _run_code
> File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/erehoba-acc-payments-req/code/Users/ourrehman/dna-payments-and-accounts/data_validation/test.py", line 22, in <module>
> CaseSchema.validate(invalid_lf, lazy=True)
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/api/dataframe/model.py", line 289, in validate
> cls.to_schema().validate(
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/api/polars/container.py", line 58, in validate
> output = self.get_backend(check_obj).validate(
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 65, in validate
> check_obj = parser(check_obj, *args)
> ^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 398, in coerce_dtype
> check_obj = self._coerce_dtype_helper(check_obj, schema)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 486, in _coerce_dtype_helper
> raise SchemaErrors(
> ^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/errors.py", line 183, in __init__
> ).failure_cases_metadata(schema.name, schema_errors)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/base.py", line 173, in failure_cases_metadata
> ).cast(
> ^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/polars/dataframe/frame.py", line 6624, in cast
> return self.lazy().cast(dtypes, strict=strict).collect(_eager=True)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1810, in collect
> return wrap_df(ldf.collect())
> ^^^^^^^^^^^^^
> polars.exceptions.ComputeError: conversion from `struct[3]` to `str` failed in column 'failure_case' for 1 out of 1 values: [{"abc","f",null}]
It should work with column that have null and are set nullable=True
pandera: 0.19.0b3 polars: 0.20.23 python: 3.11
It was a bug in 0.190b3. I created an issue: https://github.com/unionai-oss/pandera/issues/1607
The PR will fix the issue, but there is 0.19.0 release available now as well.