I'm new to Pydantic and Pandera, and need some help with class instantiation and initialization.
I have the following code in one file, sim.py
:
import pandera as pa
from pydantic import BaseModel
from datetime import datetime
class ScheduleDF(pa.SchemaModel):
person_id: Series[int] = pa.Field(ge=0, coerce=True)
shift_id: Series[int] = pa.Field(ge=0, coerce=True)
start_time: Series[datetime]
end_time: Series[datetime]
class Schedule(BaseModel):
schedule_df: DataFrame[ScheduleDF]
events_df: DataFrame[EventsDF]
@pa.check_types
def initialize_from_df(self, schedule_df: DataFrame[ScheduleDF]):
self.schedule_df = schedule_df
and the following code in another file, sim_test.py
:
from sim import ScheduleDF, Schedule
def test_schedule():
y = 2022
m = 9
d = 1
schedule_df = DataFrame[ScheduleDF](
{'person_id': [1, 2], 'shift_id': [10, 20],
'start_time': [datetime(y, m, d, 0, 0, 1), datetime(y, m, d, 0, 0, 5)],
'end_time': [datetime(y, m, d, 0, 0, 3), datetime(y, m, d, 0, 0, 6)]
}
)
sample_schedule = Schedule()
sample_schedule.initialize_from_df(schedule_df)
test_schedule()
When I run sim_testing.py
, I get the following error:
pydantic.error_wrappers.ValidationError: 2 validation errors for Schedule
schedule_df
field required (type=value_error.missing)
events_df
field required (type=value_error.missing)
I can see why the events_df
is missing - I don't initialize it inside of test_schedule()
. However, it seems that I've initialized the schedule_df
.
I tried adding @classmethod
above the @pa.check_types
decorator for initialize_from_df()
and changing self
in that function to cls
as suggested here and here, but it gave me those same errors still. It seems to be a Pydantic issue and not a Pandera issue.
I would appreciate some help figuring out what's going on and how I can rectify it. Thanks!
In your Schedule
class, you defined two mandatory fields, schedule_df
and event_df
.
class Schedule(BaseModel):
schedule_df: DataFrame[ScheduleDF]
events_df: DataFrame[EventsDF]
So, when you try to instantiate it with sample_schedule = Schedule()
, it will necessarily fail as you don't provide any value for those two mandatory fields.
Basically, you have two ways forward:
ScheduleDF
and EventsDF
when instantiating, like that (note: you forgot to give us the definition of the EventsDF
class, so I'm just making one up):schedule_df = DataFrame[ScheduleDF](
{
'person_id': [1, 2],
'shift_id': [10, 20],
'start_time': [datetime(y, m, d, 0, 0, 1), datetime(y, m, d, 0, 0, 5)],
'end_time': [datetime(y, m, d, 0, 0, 3), datetime(y, m, d, 0, 0, 6)]
}
)
events_df = DataFrame[EventsDF](
{
'event_id': [1],
'start_time': [datetime(y, m, d, 0, 0, 3)],
'end_time': [datetime(y, m, d, 0, 0, 4)]
}
)
sample_schedule = Schedule(schedule_df=schedule_df, events_df=events_df)
schedule_df
and events_df
afterwards, make them optional in the class definition:from typing import Optional
class Schedule(BaseModel):
schedule_df: Optional[DataFrame[ScheduleDF]]
events_df: Optional[DataFrame[EventsDF]]
With that, calling sample_schedule = Schedule()
will work, and sample_schedule
will basically contain schedule_df=None events_df=None
.