pythonpandasscikit-learnone-hot-encodinglabel-encoding

raise ValueError("Input contains NaN") ValueError: Input contains NaN when trying to build machine learning model


I am trying to build a prediction model but currently keep getting an error: raise ValueError("Input contains NaN") ValueError: Input contains NaN. I tried to use np.any(np.isnan(dataframe)) and np.any(np.isnan(dataframe)), but I just keep getting new errors. For example, TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''.

Here is the code so far:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
import numpy as np

dataframe = pd.read_csv('file.csv', delimiter=',')

le = LabelEncoder()
dfle = dataframe

dfle2 = dfle.apply(lambda col: le.fit_transform(col.astype(str)), axis=0, result_type='expand')

newdf = dfle2[['column1', 'column2', 'column3', 'column4', 'column5', 'column6', 'column7']]

X = dataframe[['column1', 'column2', 'column4', 'column5', 'column6', 'column7']].values

y = dfle.column3

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ohe = OneHotEncoder()

ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# np.all(np.isfinite(dfle))
# np.any(np.isnan(dfle))
X = ohe.fit_transform(X).toarray()

Solution

  • You can do multiple things to deal with this error first, you can fill the Nan values by 0 dataframe = pd.read_csv('file.csv', delimiter=',').fillna(0)

    or you can use sklearn imputation techniques to fill the Nan value.

    https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute

    Multiple Imputation techniques are available but you should use KNNImputer.