pythonpandasdataframevalidation

Best way to validate an incoming dataframe (via TSV) row by row


I'm loading a large data frame (15 columns x 30000 rows) from a tab-separated file using pandas.read_csv(). A sample data snippet shown in table below (Thank you to @Lukas for demonstrating how to get a show sample):

ID Name Latitude List Country
1 New York 40 a United States (USA)
2 London -97 c/d/e United Kingdom (UK)
three Rome 42 f/g Italy (I)
4 Los Angeles 34N f/g USA

Would yield something like this after validation:

ID Name Latitude List Country Code
1 New York 40 [a] United States USA
2 London NaN [c,d,e] United Kingdom UK
ROW REMOVED
4 Los Angeles NaN [f,g] NaN NaN

I would like to validate every row, and toss out rows that don't pass validation. For instance, I need to make sure that certain columns contain floats. I originally thought of using the dtype parameter in read_csv() like thus: df = pd.read_csv(filename, sep='\t', dtype={"cola": str, "colb": float, ...})

However if read_csv encounters a non-float entry in the float column it will throw an exception and stop processing.

Types of checks I need to do on a per row basis:

Depending on the error I may need to toss out the row, or simply NaN the bad value. I do need to keep track of the failed validations and report them.

Currently I have implemented: df.apply(lambda x: row_processor(x, error_log)) where row_processor() is the function that does all the checks returns an updated row (if appropriate) and an error_log entry.

Are there any flaws with that approach? Other ideas I've had are just to brute force a 'for' loop across the rows, or to use df.apply() on one column at a time with a function specific to that column.

Is there data validation functionality with pandas that I'm missing?


Solution

  • Given that there is no example, here is what I would do. I do not consider 30k rows large, as this is something to be easily handled on any machine in memory.

    Thus, I would load the data in memory and afterwards perform each check one after another. Since the checks all need to pass, it is fine to remove rows based on each check separately.

    data = {
        'ID': ['1', '2', 'three', '4', '5'],
        'Latitude': ['45.0', '-100.0', '85.5', 'abc', '60.0'],
        'Name': ['John Doe', 'Jane Smith', 'Alice', 'Bob Johnson', 'Charlie Brown'],
        'Tags': ['a/b/c', 'd/e', 'f', 'g/h/i/j', 'k/l']
    }
    df = pd.DataFrame(data)
    
    # perform checks:
    # 1. convert types
    df['ID'] = pd.to_numeric(df['ID'], errors='coerce')
    df['Latitude'] = pd.to_numeric(df['Latitude'], errors='coerce')
    
    # 2. check ranges
    df = df[df['Latitude'].between(-90,90)].copy()
    
    # 3. Remove NaN (i.e. non numeric IDs)
    df = df[df['ID'].notnull()]
    
    # 4. Split columns in multiple columns
    def split_tags(tag_str):
        return tag_str.split('/') if isinstance(tag_str, str) else []
    df['TagsList'] = df['Tags'].apply(split_tags)
    
    # further checks ...
    

    Hope that helps as guidance.