pythonpandas

Pandas read_csv: Skip rows contains invalid data that can cause data_type parsing errors


The csv file can contain string values to certain integer columns and i want to ignore/handle via callback if that happens, tried using on_bad_lines='skip/warn' however it gets triggered only on wherever there are parsing issues due to delimiter/num of columns mismatch. Here is the code i am using

csv:

id,name,age
1,,"25"
2,"",30
jj,John,

import pandas as pd
schema = {'id':'int64','name':'str','age':'int'}
df = pd.read_csv('a.csv', dtype=schema,  on_bad_lines='warn', engine='python')


Solution

  • I assume you mean making NaN values by ignore. For that, you can first extract as a String, and then convert to a numeric with 'coerce' for the errors parameter in pandas which will turn those values to a NaN.

    df = pd.read_csv('a.csv', dtype=str,  on_bad_lines='warn', engine='python')
    df['id'] = pd.to_numeric(df['id'], errors='coerce')
    df['age'] = pd.to_numeric(df['age'], errors='coerce')
    

    If you don't want NaN values, you can choose one of the values for the errors parameter in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

    'coerce' converts those to a NaN, 'raise' (default) raises an exception, and 'ignore' just keeps the string as-is.

    Since using 'ignore' is depreciated, an alternative is this:

    def safe_to_numeric(val):
        try:
            return pd.to_numeric(val)
        except Exception:
            return val # or whatever you wanna do with that value (return NaN, return 0, etc)
    

    and then

    df['id'] = df['id'].apply(safe_to_numeric)
    df['age'] = df['age'].apply(safe_to_numeric)
    

    If you use the 'ignore' logic (custom or built-in), this is the output:

       id  name   age
    0   1   NaN  25.0
    1   2   NaN  30.0
    2  jj  John   NaN
    

    If you use the errors='coerce' , this is the output:

        id  name   age
    0  1.0   NaN  25.0
    1  2.0   NaN  30.0
    2  NaN  John   NaN
    

    And the errors='raise' exception:

    ValueError: Unable to parse string "jj" at position 0