pythoninitializationinstantiationpydanticpandera

Initializing Class Attributes With Pydantic and Pandera


I'm new to Pydantic and Pandera, and need some help with class instantiation and initialization.

I have the following code in one file, sim.py:

import pandera as pa
from pydantic import BaseModel
from datetime import datetime

class ScheduleDF(pa.SchemaModel):
    person_id: Series[int] = pa.Field(ge=0, coerce=True)
    shift_id: Series[int] = pa.Field(ge=0, coerce=True)
    start_time: Series[datetime]
    end_time: Series[datetime]

class Schedule(BaseModel):
    schedule_df: DataFrame[ScheduleDF]
    events_df: DataFrame[EventsDF]

    @pa.check_types
    def initialize_from_df(self, schedule_df: DataFrame[ScheduleDF]):
        self.schedule_df = schedule_df

and the following code in another file, sim_test.py:

from sim import ScheduleDF, Schedule

def test_schedule():
    y = 2022
    m = 9
    d = 1

    schedule_df = DataFrame[ScheduleDF](
        {'person_id': [1, 2], 'shift_id': [10, 20],
         'start_time': [datetime(y, m, d, 0, 0, 1), datetime(y, m, d, 0, 0, 5)],
         'end_time': [datetime(y, m, d, 0, 0, 3), datetime(y, m, d, 0, 0, 6)]
        }
    )
    sample_schedule = Schedule()
    sample_schedule.initialize_from_df(schedule_df)

test_schedule()

When I run sim_testing.py, I get the following error:

pydantic.error_wrappers.ValidationError: 2 validation errors for Schedule
schedule_df
field required (type=value_error.missing)
events_df
field required (type=value_error.missing)

I can see why the events_df is missing - I don't initialize it inside of test_schedule(). However, it seems that I've initialized the schedule_df.

I tried adding @classmethod above the @pa.check_types decorator for initialize_from_df() and changing self in that function to cls as suggested here and here, but it gave me those same errors still. It seems to be a Pydantic issue and not a Pandera issue.

I would appreciate some help figuring out what's going on and how I can rectify it. Thanks!


Solution

  • In your Schedule class, you defined two mandatory fields, schedule_df and event_df.

    class Schedule(BaseModel):
        schedule_df: DataFrame[ScheduleDF]
        events_df: DataFrame[EventsDF]
    

    So, when you try to instantiate it with sample_schedule = Schedule(), it will necessarily fail as you don't provide any value for those two mandatory fields.

    Basically, you have two ways forward:

    1. Pass instances of ScheduleDF and EventsDF when instantiating, like that (note: you forgot to give us the definition of the EventsDF class, so I'm just making one up):
    schedule_df = DataFrame[ScheduleDF](
        {
            'person_id': [1, 2],
            'shift_id': [10, 20],
            'start_time': [datetime(y, m, d, 0, 0, 1), datetime(y, m, d, 0, 0, 5)],
            'end_time': [datetime(y, m, d, 0, 0, 3), datetime(y, m, d, 0, 0, 6)]
        }
    )
    
    events_df = DataFrame[EventsDF](
        {
            'event_id': [1],
            'start_time': [datetime(y, m, d, 0, 0, 3)],
            'end_time': [datetime(y, m, d, 0, 0, 4)]
        }
    )
    
    sample_schedule = Schedule(schedule_df=schedule_df, events_df=events_df)
    
    1. Or, if you really want to assign values to schedule_df and events_df afterwards, make them optional in the class definition:
    from typing import Optional
    
    
    class Schedule(BaseModel):
        schedule_df: Optional[DataFrame[ScheduleDF]]
        events_df: Optional[DataFrame[EventsDF]]
    

    With that, calling sample_schedule = Schedule() will work, and sample_schedule will basically contain schedule_df=None events_df=None.