Best way to validate an incoming dataframe (via TSV) row by row

I'm loading a large data frame (15 columns x 30000 rows) from a tab-separated file using pandas.read_csv(). A sample data snippet shown in table below (Thank you to @Lukas for demonstrating how to get a show sample):

ID	Name	Latitude	List	Country
1	New York	40	a	United States (USA)
2	London	-97	c/d/e	United Kingdom (UK)
three	Rome	42	f/g	Italy (I)
4	Los Angeles	34N	f/g	USA

Would yield something like this after validation:

ID	Name	Latitude	List	Country	Code
1	New York	40	[a]	United States	USA
2	London	NaN	[c,d,e]	United Kingdom	UK
ROW REMOVED
4	Los Angeles	NaN	[f,g]	NaN	NaN

I would like to validate every row, and toss out rows that don't pass validation. For instance, I need to make sure that certain columns contain floats. I originally thought of using the dtype parameter in read_csv() like thus: df = pd.read_csv(filename, sep='\t', dtype={"cola": str, "colb": float, ...})

However if read_csv encounters a non-float entry in the float column it will throw an exception and stop processing.

Types of checks I need to do on a per row basis:

Check that entry is of appropriate type, and convert string to said type
Check that values are within appropriate bounds (e.g. latitude is between -90 and 90)
For some entries I need to extract part of the data and put it in a new column
Convert some entries that are internally separated into lists that reside within the dataframe (e.g. "a/b/c" --> ["a", "b", "c"])

Depending on the error I may need to toss out the row, or simply NaN the bad value. I do need to keep track of the failed validations and report them.

Currently I have implemented: df.apply(lambda x: row_processor(x, error_log)) where row_processor() is the function that does all the checks returns an updated row (if appropriate) and an error_log entry.

Are there any flaws with that approach? Other ideas I've had are just to brute force a 'for' loop across the rows, or to use df.apply() on one column at a time with a function specific to that column.

Is there data validation functionality with pandas that I'm missing?

Solution

Given that there is no example, here is what I would do. I do not consider 30k rows large, as this is something to be easily handled on any machine in memory.

Thus, I would load the data in memory and afterwards perform each check one after another. Since the checks all need to pass, it is fine to remove rows based on each check separately.

data = {
    'ID': ['1', '2', 'three', '4', '5'],
    'Latitude': ['45.0', '-100.0', '85.5', 'abc', '60.0'],
    'Name': ['John Doe', 'Jane Smith', 'Alice', 'Bob Johnson', 'Charlie Brown'],
    'Tags': ['a/b/c', 'd/e', 'f', 'g/h/i/j', 'k/l']
}
df = pd.DataFrame(data)

# perform checks:
# 1. convert types
df['ID'] = pd.to_numeric(df['ID'], errors='coerce')
df['Latitude'] = pd.to_numeric(df['Latitude'], errors='coerce')

# 2. check ranges
df = df[df['Latitude'].between(-90,90)].copy()

# 3. Remove NaN (i.e. non numeric IDs)
df = df[df['ID'].notnull()]

# 4. Split columns in multiple columns
def split_tags(tag_str):
    return tag_str.split('/') if isinstance(tag_str, str) else []
df['TagsList'] = df['Tags'].apply(split_tags)

# further checks ...

Hope that helps as guidance.