pythonmachine-learningscikit-learn

Getting "TypeError: ufunc 'isnan' not supported for the input types"


I am doing a Machine Learning project to predict the prices of electric cars on Jupyter Notebook.

I run these cells:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']
for col in cols:
    le.fit(t[col])
    x[col] = le.transform(x[col]) 
    print(le.classes_)

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.5, random_state = 0)

r2_score(y_test, lm.predict(x_test))


from sklearn.tree import DecisionTreeRegressor 
regressor = DecisionTreeRegressor(random_state = 0) 
regressor.fit(x_train, y_train)
r2_score(y_test, regressor.predict(x_test))


r2_score(y_train, regressor.predict(x_train))

uv = np.nanpercentile(df2['Base MSRP'], [99])[0]*2


df2['Base MSRP'][(df2['Base MSRP']>uv)] = uv


df2 = df2[df2['Model Year'] != 'N/']  # Filter out rows where 'Model Year' is 'N/'

for col in cols:
    df2[col] = df2[col].replace('N/', -1)
    le.fit(df2[col])
    df2[col] = le.transform(df2[col]) 
    print(le.classes_)

le = preprocessing.LabelEncoder()

cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']

for col in cols:
    le.fit(t[col])
    df2[col] = le.transform(df2[col]) 
    print(le.classes_)

I get this error:

TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_16424\1094749331.py in <module>
      1 for col in cols:
      2     le.fit(t[col])
----> 3     df2[col] = le.transform(df2[col])
      4     print(le.classes_)

~\.conda\envs\electricvehiclepriceprediction\lib\site-packages\sklearn\preprocessing\_label.py in transform(self, y)
    136             return np.array([])
    137 
--> 138         return _encode(y, uniques=self.classes_)
    139 
    140     def inverse_transform(self, y):

~\.conda\envs\electricvehiclepriceprediction\lib\site-packages\sklearn\utils\_encode.py in _encode(values, uniques, check_unknown)
    185     else:
    186         if check_unknown:
--> 187             diff = _check_unknown(values, uniques)
    188             if diff:
    189                 raise ValueError(f"y contains previously unseen labels: {str(diff)}")

~\.conda\envs\electricvehiclepriceprediction\lib\site-packages\sklearn\utils\_encode.py in _check_unknown(values, known_values, return_mask)
    259 
    260         # check for nans in the known_values
--> 261         if np.isnan(known_values).any():
    262             diff_is_nan = np.isnan(diff)
    263             if diff_is_nan.any():

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

What did I try?

I tried using the following code:

le = preprocessing.LabelEncoder()
cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']
for col in cols:
    le.fit(t[col])
    df2[col] = le.transform(df2[col]) 
    print(le.classes_)

The code gives me the specific error.

To fix the issue, I tried imputing the missing value ("N/") instead of removing it by using this code:

for col in cols:
  le.fit(t[col].fillna('Missing'))  # Impute missing values with 'Missing'
  df2[col] = le.transform(df2[col].fillna('Missing'))
  print(le.classes_)

But still, I get the same error.

Here is the link to my notebook: https://github.com/SteveAustin583/electric-vehicle-price-prediction-revengers/blob/main/revengers.ipynb

Here is the link to the dataset: https://www.kaggle.com/datasets/rithurajnambiar/electric-vehicle-data

How to fix this issue?


Solution

  • Don't mix the dataframes df/df2/t for working with the training set and only encode once the training set. The code below will not lead to any Python error:

    df_train = pd.read_csv('train.csv',header=0)
    le = preprocessing.LabelEncoder()
    cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 
            'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']
    
    for col in cols:
        le.fit(df_train[col])
        df_train[col] = le.transform(df_train[col]) 
    

    But as scikit-learn doc says about LabelEncoder "This transformer should be used to encode target values, i.e. y, and not the input X." Instead you can use OrdinalEncoder() which will gives the same result as above. It will also make the code much shorter for the training set as you will need only 1 line:

    df_train[cols] = oe.fit_transform(df_train[cols]) 
    

    with

    oe = preprocessing.OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
    

    Notice you will have to do the same for the test set BUT use the encoder fitted on the training set and use it to encode the test set (to avoid information leakage from test to train set) i.e. just use oe.transform.

    I added the 2 parameters when creating oe so that it puts -1 each time it finds an unknown category in the test set (if not you would get an error).

    See the doc for OrdinalEnncoder at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

    PS: before doing the encoding you should deal with the missing values (impute/delete/...).