I'm loading a large data frame (15 columns x 30000 rows) from a tab-separated file using pandas.read_csv()
. A sample data snippet shown in table below (Thank you to @Lukas for demonstrating how to get a show sample):
ID | Name | Latitude | List | Country |
---|---|---|---|---|
1 | New York | 40 | a | United States (USA) |
2 | London | -97 | c/d/e | United Kingdom (UK) |
three | Rome | 42 | f/g | Italy (I) |
4 | Los Angeles | 34N | f/g | USA |
Would yield something like this after validation:
ID | Name | Latitude | List | Country | Code |
---|---|---|---|---|---|
1 | New York | 40 | [a] | United States | USA |
2 | London | NaN | [c,d,e] | United Kingdom | UK |
ROW REMOVED | |||||
4 | Los Angeles | NaN | [f,g] | NaN | NaN |
I would like to validate every row, and toss out rows that don't pass validation. For instance, I need to make sure that certain columns contain floats. I originally thought of using the dtype
parameter in read_csv()
like thus:
df = pd.read_csv(filename, sep='\t', dtype={"cola": str, "colb": float, ...})
However if read_csv
encounters a non-float entry in the float column it will throw an exception and stop processing.
Types of checks I need to do on a per row basis:
Depending on the error I may need to toss out the row, or simply NaN the bad value. I do need to keep track of the failed validations and report them.
Currently I have implemented:
df.apply(lambda x: row_processor(x, error_log))
where row_processor()
is the function that does all the checks returns an updated row (if appropriate) and an error_log
entry.
Are there any flaws with that approach? Other ideas I've had are just to brute force a 'for' loop across the rows, or to use df.apply()
on one column at a time with a function specific to that column.
Is there data validation functionality with pandas that I'm missing?
Given that there is no example, here is what I would do. I do not consider 30k rows large, as this is something to be easily handled on any machine in memory.
Thus, I would load the data in memory and afterwards perform each check one after another. Since the checks all need to pass, it is fine to remove rows based on each check separately.
data = {
'ID': ['1', '2', 'three', '4', '5'],
'Latitude': ['45.0', '-100.0', '85.5', 'abc', '60.0'],
'Name': ['John Doe', 'Jane Smith', 'Alice', 'Bob Johnson', 'Charlie Brown'],
'Tags': ['a/b/c', 'd/e', 'f', 'g/h/i/j', 'k/l']
}
df = pd.DataFrame(data)
# perform checks:
# 1. convert types
df['ID'] = pd.to_numeric(df['ID'], errors='coerce')
df['Latitude'] = pd.to_numeric(df['Latitude'], errors='coerce')
# 2. check ranges
df = df[df['Latitude'].between(-90,90)].copy()
# 3. Remove NaN (i.e. non numeric IDs)
df = df[df['ID'].notnull()]
# 4. Split columns in multiple columns
def split_tags(tag_str):
return tag_str.split('/') if isinstance(tag_str, str) else []
df['TagsList'] = df['Tags'].apply(split_tags)
# further checks ...
Hope that helps as guidance.