pythonpandaspandera

How to validate dataframe in pandera using multiple columns


I have following dataframe. Need to validate dataframe to check if there exists rows with columns Name and tag both NULL at the same time. I tried following - but index where it fails are 0 & 2.

 import pandas as pd
 import pandera as pa

 data = [['Alex',10,'t1'],['Bob',12,None],['Clarke',13,'t3'],[None,14,'t3'],[None,15,None]]
 df = pd.DataFrame(data,columns=['Name','Age','Tag']) 
 schema = pa.DataFrameSchema(checks=pa.Check(lambda df: ~(pd.notnull(df["Name"])&pd.notnull(df["Tag"])) )
)

try:
    schema.validate(df)
except pa.errors.SchemaErrors as err:
    print("Schema errors and failure cases:")
    print(err.failure_cases)

I want above code to return index as 4. How should I create check for pandera schema.


Solution

  • As per the docs on Handling null values,

    By default, pandera drops null values before passing the objects to validate into the check function. For Series objects null elements are dropped (this also applies to columns), and for DataFrame objects, rows with any null value are dropped.

    If you want to check the properties of a pandas data structure while preserving null values, specify Check(..., ignore_na=False) when defining a check.

    That way, make sure to add ignore_na=False:

    schema = pa.DataFrameSchema(checks=pa.Check(lambda df: 
                                               ~(df['Name'].isnull() & 
                                                 df['Tag'].isnull()),
                                ignore_na=False))