pandaspandera

How do I validate a value in a dataframe which is dependent on other value in that specific row?


Suppose I have a .csv which follows this format:

Name, Salary, Department, Mandatory

Rob, 5500, Aviation, Yes

Bob, 1000, Facilities, No

Tom, 6000, IT, Yes

After exporting this to pandas/modin, I'd like to perform row-differentiated checks, where:

  1. People named Rob working in aviation cannot earn less than 5000

  2. People named Bob working in facilities cannot earn less than 1000

  3. Whoever works in facilities has to report their salary, while people working in aviation or IT can choose to leave their salary unreported.

  4. If any check is violated, we store this in a dataframe and pass forward this case to the human resources department for further investigation.

How would you validate this .csv using Pandera?


Solution

  • Depending on which API you're using, you can check out the wide checks for the object-based API or dataframe checks for the class-based API.

    Note: the code snippets below aren't tested, but should be going in the right direction

    Class-based API:

    import pandera as pa
    from pandera.typing as Series
    
    class Schema(pa.SchemaModel):
        Name: Series[str]
        Salary: Series[int]
        Department: Series[str]
        Mandatory: Series[str]
        
        @pa.dataframe_check
        def rob_aviation_check(cls, df) -> Series[bool]:
            return df.loc[df["Name"] == "Rob" & df["Department"] == "Aviation", "Salary"] >= 5000
    

    Object-based API:

    schema = DataFrameSchema(
        columns={
            "Name": pa.Column(str),
            "Salary": Pa.Column(int),
            ...
        }
        checks=[
            pa.Check(lambda df: df.loc[df["Name"] == "Rob" & df["Department"] == "Aviation", "Salary"] >= 5000)
        ]
    )