Suppose I have a .csv which follows this format:
Name, Salary, Department, Mandatory
Rob, 5500, Aviation, Yes
Bob, 1000, Facilities, No
Tom, 6000, IT, Yes
After exporting this to pandas/modin, I'd like to perform row-differentiated checks, where:
People named Rob working in aviation cannot earn less than 5000
People named Bob working in facilities cannot earn less than 1000
Whoever works in facilities has to report their salary, while people working in aviation or IT can choose to leave their salary unreported.
If any check is violated, we store this in a dataframe and pass forward this case to the human resources department for further investigation.
How would you validate this .csv using Pandera?
Depending on which API you're using, you can check out the wide checks for the object-based API or dataframe checks for the class-based API.
Note: the code snippets below aren't tested, but should be going in the right direction
Class-based API:
import pandera as pa
from pandera.typing as Series
class Schema(pa.SchemaModel):
Name: Series[str]
Salary: Series[int]
Department: Series[str]
Mandatory: Series[str]
@pa.dataframe_check
def rob_aviation_check(cls, df) -> Series[bool]:
return df.loc[df["Name"] == "Rob" & df["Department"] == "Aviation", "Salary"] >= 5000
Object-based API:
schema = DataFrameSchema(
columns={
"Name": pa.Column(str),
"Salary": Pa.Column(int),
...
}
checks=[
pa.Check(lambda df: df.loc[df["Name"] == "Rob" & df["Department"] == "Aviation", "Salary"] >= 5000)
]
)