[SOLVED] Pandas read_csv: Skip rows contains invalid data that can cause data

Pandas read_csv: Skip rows contains invalid data that can cause data_type parsing errors

The csv file can contain string values to certain integer columns and i want to ignore/handle via callback if that happens, tried using on_bad_lines='skip/warn' however it gets triggered only on wherever there are parsing issues due to delimiter/num of columns mismatch. Here is the code i am using

csv:

id,name,age
1,,"25"
2,"",30
jj,John,

import pandas as pd
schema = {'id':'int64','name':'str','age':'int'}
df = pd.read_csv('a.csv', dtype=schema,  on_bad_lines='warn', engine='python')

Solution

I assume you mean making NaN values by ignore. For that, you can first extract as a String, and then convert to a numeric with 'coerce' for the errors parameter in pandas which will turn those values to a NaN.

df = pd.read_csv('a.csv', dtype=str,  on_bad_lines='warn', engine='python')
df['id'] = pd.to_numeric(df['id'], errors='coerce')
df['age'] = pd.to_numeric(df['age'], errors='coerce')

If you don't want NaN values, you can choose one of the values for the errors parameter in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

'coerce' converts those to a NaN, 'raise' (default) raises an exception, and 'ignore' just keeps the string as-is.

Since using 'ignore' is depreciated, an alternative is this:

def safe_to_numeric(val):
    try:
        return pd.to_numeric(val)
    except Exception:
        return val # or whatever you wanna do with that value (return NaN, return 0, etc)

and then

df['id'] = df['id'].apply(safe_to_numeric)
df['age'] = df['age'].apply(safe_to_numeric)

If you use the 'ignore' logic (custom or built-in), this is the output:

   id  name   age
0   1   NaN  25.0
1   2   NaN  30.0
2  jj  John   NaN

If you use the errors='coerce' , this is the output:

    id  name   age
0  1.0   NaN  25.0
1  2.0   NaN  30.0
2  NaN  John   NaN

And the errors='raise' exception:

ValueError: Unable to parse string "jj" at position 0