I have received a dataframe from an institute and the column names have some special characters which are not allowed in Python variable naming. I would like to use the DataFrameModel
and NOT the DataFrameSchema
in pandera to create a schema to validate the dataset. The problem is that, I can not use these column names with special characters as python variables.
Here is a minimal working example: Let us suppose I want to use the DataFrameSchema
class. In that case, I will just do this:
from pandera import DataFrameSchema, Column
import pandera as pa
schema = DataFrameSchema(
columns={
"Time.Phase": Column(
dtype=float,
nullable=False,
unique=False,
coerce=True,
required=True,
checks=pa.Check.greater_than_or_equal_to(min_value=0.1),
description="Time measurement in seconds within the selected phase."
),
"Phase._dynamic": Column(
dtype=float,
nullable=False,
unique=False,
coerce=True,
required=True,
checks=pa.Check.greater_than_or_equal_to(min_value=0.5),
description="Measurement of phase dynamics."
)
}
)
valid_data = pd.DataFrame.from_records([
{"Time.Phase": 0.1, "Phase._dynamic": 0.5},
{"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])
invalid_data = pd.DataFrame.from_records([
{"Time.Phase": 0.1, "Phase._dynamic": -0.5},
{"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])
schema.validate(valid_data, lazy=True)
try:
schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
If I do something similar using the DataFrameModel
class, it is supposed to be like this:
import pandera as pa
from pandera.typing import Series
class Schema(pa.DataFrameModel):
Time.Phase: Series[float] = pa.Field(
nullable=False,
unique=False,
coerce=True,
description="Time measurement in seconds within the selected phase."
)
Phase._dynamic: Series[float] = pa.Field(
nullable=False,
unique=False,
coerce=True,
description="Measurement of phase dynamics."
)
However, Time.Phase
and Phase._dynamic
are NOT valid variables in Python, hence can not be used as columns names and running the code gives NameError
.
Here is what I tried. I first created the Python file institute_data_columns.py
with the code below as content.
class InstituteDataColumns:
time_phase_1 = "Time.Phase"
phase_dynamic = "Phase._dynamic"
Next, I created another Python file, institute_data_schema.py
with the following content:
from institute_data_columns import InstituteDataColumns
from pandera.typing import Series
import pandera as pa
class Schema(pa.DataFrameModel):
InstituteDataColumns.time_phase_1: Series[float] = pa.Field(ge=0.1,
nullable=False,
coerce=True,
description="Time measurement in seconds within the selected phase."
)
InstituteDataColumns.phase_dynamic: Series[float] = pa.Field(
ge=0.5,
nullable=False,
coerce=True,
description="Measurement of phase dynamics."
)
Schema.validate(valid_data, lazy=True)
try:
Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
Note that, in try
-except
-block, where I validate both valid_data
and invalid_data
both works, which is not supposed to be so. Moreover, I can not retrieve the column names defined in the Schema
class. That is, Schema._collect_fields()
returns {}
, however not so for schema.columns
. Is there a way I will be able to retrieve the column names in the Schema
class?
I know that, I can use pandas
to rename the column names using this:
import pandas as pd
pd.DataFrame(valid_data).rename(columns={
"Time.Phase": "time_phase",
"Phase._dynamic": "phase_dynamic"
})
But this that is not allowed either.
I found the answer on this pandera documentation page. All I needed to do was to use the alias
keyword available in pandera.Field
and assign it to the column name containing unsupported characters. The complete code is below:
from pandera.typing import Series
import pandera as pa
class InstituteDataColumns:
time_phase_1 = "Time.Phase"
phase_dynamic = "Phase._dynamic"
class Schema(pa.DataFrameModel):
time_phase: Series[float] = pa.Field(
ge=0.1,
alias=InstituteDataColumns.time_phase_1,
nullable=False,
coerce=True,
description="Time measurement in seconds within the selected phase."
)
phase_dynamic: Series[float] = pa.Field(
ge=0.5,
alias=InstituteDataColumns.phase_dynamic,
nullable=False,
coerce=True,
description="Measurement of phase dynamics."
)
Schema.validate(valid_data, lazy=True)
try:
Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
Now Schema._collect_fields()
should work now.