pythonpandaspandera

pandera.errors.BackendNotFoundError with Pandas Dataframe


pandera: 0.18.3
pandas: 2.2.2
python: 3.9/3.11

Hi,

I am unable to setup the pandera for pandas dataframe as it complains:

File "/anaconda/envs/data_quality_env/lib/python3.9/site-packages/pandera/api/base/schema.py",
line 96, in get_backend
        raise BackendNotFoundError(
    pandera.errors.BackendNotFoundError: Backend not found for backend, class: (<class 'data_validation.schemas.case.CaseSchema'>,
<class 'pandas.core.frame.DataFrame'>). Looked up the following base
classes: (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.generic.NDFrame'>, <class 'pandas.core.base.PandasObject'>, <class 'pandas.core.accessor.DirNamesMixin'>, <class 'pandas.core.indexing.IndexingMixin'>, <class 'pandas.core.arraylike.OpsMixin'>, <class 'object'>)

My folder structure is:

project/
    data_validation/
        schema/
            case.py
        validation/
            validations.py
    pipeline.py

case.py:

import pandas as pd
import pandera as pa

class CaseSchema(pa.DataFrameSchema):
    case_id = pa.Column(pa.Int)

validations.py

import pandas as pd
from data_validation.schemas.case import CaseSchema

def validate_case_data(df: pd.DataFrame) -> pd.DataFrame:
    """Validate a DataFrame against the PersonSchema."""
    schema = CaseSchema()
    return schema.validate(df)

pipeline.py

import pandas as pd
from data_validation.validation.validations import validate_case_data

def validate_df(df: pd.DataFrame) -> pd.DataFrame:
    """Process data, validating it against the PersonSchema."""
    validated_df = validate_case_data(df)
    return validated_df

df = pd.DataFrame({
    "case_id": [1, 2, 3]
})

processed_df = validate_df(df)

Solution

  • This can be solved by including a get_backend method in CaseSchema:

    import pandas as pd
    import pandera as pa
    from pandera.backends.pandas.container import DataFrameSchemaBackend
    
    class CaseSchema(pa.DataFrameSchema):
        case_id = pa.Column(pa.Int)
    
        @classmethod
        def get_backend(cls, check_obj=None, check_type=None):
            if check_obj is not None:
                check_obj_cls = type(check_obj)
            elif check_type is not None:
                check_obj_cls = check_type
            else:
                raise ValueError("Must pass in one of `check_obj` or `check_type`.")
    
            cls.register_default_backends(check_obj_cls)
            return DataFrameSchemaBackend()