The csv file can contain string values to certain integer columns and i want to ignore/handle via callback if that happens, tried using on_bad_lines='skip/warn' however it gets triggered only on wherever there are parsing issues due to delimiter/num of columns mismatch. Here is the code i am using
csv:
id,name,age
1,,"25"
2,"",30
jj,John,
import pandas as pd
schema = {'id':'int64','name':'str','age':'int'}
df = pd.read_csv('a.csv', dtype=schema, on_bad_lines='warn', engine='python')
I assume you mean making NaN
values by ignore. For that, you can first extract as a String, and then convert to a numeric with 'coerce' for the errors parameter in pandas which will turn those values to a NaN
.
df = pd.read_csv('a.csv', dtype=str, on_bad_lines='warn', engine='python')
df['id'] = pd.to_numeric(df['id'], errors='coerce')
df['age'] = pd.to_numeric(df['age'], errors='coerce')
If you don't want NaN
values, you can choose one of the values for the errors parameter in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html
'coerce' converts those to a NaN
, 'raise' (default) raises an exception, and 'ignore' just keeps the string as-is.
Since using 'ignore' is depreciated, an alternative is this:
def safe_to_numeric(val):
try:
return pd.to_numeric(val)
except Exception:
return val # or whatever you wanna do with that value (return NaN, return 0, etc)
and then
df['id'] = df['id'].apply(safe_to_numeric)
df['age'] = df['age'].apply(safe_to_numeric)
If you use the 'ignore' logic (custom or built-in), this is the output:
id name age
0 1 NaN 25.0
1 2 NaN 30.0
2 jj John NaN
If you use the errors='coerce'
, this is the output:
id name age
0 1.0 NaN 25.0
1 2.0 NaN 30.0
2 NaN John NaN
And the errors='raise'
exception:
ValueError: Unable to parse string "jj" at position 0