pythonpandasdataframedatabase-schemapandera

A pandera DataFrame Schema with special characters in column names


I have received a dataframe from an institute and the column names have some special characters which are not allowed in Python variable naming. I would like to use the DataFrameModel and NOT the DataFrameSchema in pandera to create a schema to validate the dataset. The problem is that, I can not use these column names with special characters as python variables.

Here is a minimal working example: Let us suppose I want to use the DataFrameSchema class. In that case, I will just do this:

from pandera import DataFrameSchema, Column
import pandera as pa

schema = DataFrameSchema(
    columns={
        "Time.Phase": Column(
            dtype=float,
            nullable=False,
            unique=False,
            coerce=True,
            required=True, 
            checks=pa.Check.greater_than_or_equal_to(min_value=0.1),
            description="Time measurement in seconds within the selected phase."
        ),
        "Phase._dynamic": Column(
            dtype=float,
            nullable=False,
            unique=False,
            coerce=True,
            required=True, 
            checks=pa.Check.greater_than_or_equal_to(min_value=0.5),
            description="Measurement of phase dynamics."
        )
    }
)

valid_data = pd.DataFrame.from_records([
    {"Time.Phase": 0.1, "Phase._dynamic": 0.5},
    {"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])

invalid_data = pd.DataFrame.from_records([
    {"Time.Phase": 0.1, "Phase._dynamic": -0.5},
    {"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])

schema.validate(valid_data, lazy=True)

try:
    schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

If I do something similar using the DataFrameModel class, it is supposed to be like this:

import pandera as pa
from pandera.typing import Series

class Schema(pa.DataFrameModel):
    Time.Phase: Series[float] = pa.Field(
        nullable=False, 
        unique=False,
        coerce=True,
        description="Time measurement in seconds within the selected phase."
    )
    Phase._dynamic: Series[float] = pa.Field(
        nullable=False, 
        unique=False,
        coerce=True,
        description="Measurement of phase dynamics."
    )

However, Time.Phase and Phase._dynamic are NOT valid variables in Python, hence can not be used as columns names and running the code gives NameError.

Here is what I tried. I first created the Python file institute_data_columns.py with the code below as content.

class InstituteDataColumns:
    time_phase_1 = "Time.Phase"
    phase_dynamic = "Phase._dynamic"

Next, I created another Python file, institute_data_schema.py with the following content:

from institute_data_columns import InstituteDataColumns
from pandera.typing import Series
import pandera as pa

class Schema(pa.DataFrameModel):
    InstituteDataColumns.time_phase_1: Series[float] = pa.Field(ge=0.1,
        nullable=False,
        coerce=True, 
        description="Time measurement in seconds within the selected phase."
    )
    InstituteDataColumns.phase_dynamic: Series[float] = pa.Field(
        ge=0.5,
        nullable=False,
        coerce=True,
        description="Measurement of phase dynamics."
    )

Schema.validate(valid_data, lazy=True)

try:
    Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

Note that, in try-except-block, where I validate both valid_data and invalid_data both works, which is not supposed to be so. Moreover, I can not retrieve the column names defined in the Schema class. That is, Schema._collect_fields() returns {}, however not so for schema.columns. Is there a way I will be able to retrieve the column names in the Schema class?

I know that, I can use pandas to rename the column names using this:

import pandas as pd
pd.DataFrame(valid_data).rename(columns={
    "Time.Phase": "time_phase", 
    "Phase._dynamic": "phase_dynamic"
})

But this that is not allowed either.


Solution

  • I found the answer on this pandera documentation page. All I needed to do was to use the alias keyword available in pandera.Field and assign it to the column name containing unsupported characters. The complete code is below:

    from pandera.typing import Series
    import pandera as pa
    
    class InstituteDataColumns:
        time_phase_1 = "Time.Phase"
        phase_dynamic = "Phase._dynamic"
    
    class Schema(pa.DataFrameModel):
        time_phase: Series[float] = pa.Field(
            ge=0.1, 
            alias=InstituteDataColumns.time_phase_1,
            nullable=False,
            coerce=True, 
            description="Time measurement in seconds within the selected phase."
        )
        phase_dynamic: Series[float] = pa.Field(
            ge=0.5,
            alias=InstituteDataColumns.phase_dynamic,
            nullable=False,
            coerce=True,
            description="Measurement of phase dynamics."
        )
    
    Schema.validate(valid_data, lazy=True)
    
    try:
        Schema.validate(invalid_data, lazy=True)
    except pa.errors.SchemaErrors as exc:
        display(exc.failure_cases)
    

    Now Schema._collect_fields() should work now.