pythonpandasmachine-learningscikit-learnsmote

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']


I already referred the posts here, here and here. Don't mark it as duplicate.

I am working on a binary classification problem where my dataset has categorical and numerical columns.

However, some of the categorical columns has a mix of numeric and string values. Nontheless, they only indicate the category name.

For instance, I have a column called biz_category which has values like A,B,C,4,5 etc.

I guess the below error is thrown due to values like 4 and 5.

Therefore, I tried the belowm to convert them into category datatype. (but still it doesn't work)

cols=X_train.select_dtypes(exclude='int').columns.to_list()
X_train[cols]=X_train[cols].astype('category')

And my data info looks like below

<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 21 to 965
Data columns (total 9 columns):
 #   Column                                           Non-Null Count  Dtype   
---  ------                                           --------------  -----   
 0   Feature_A                                        683 non-null    category
 1   Product Classification                           683 non-null    category
 2   Industry                                         683 non-null    category
 3   DIVISION                                         683 non-null    category
 4   biz_category                                     683 non-null    category
 5   Country                                          683 non-null    category
 6   Product segment                                  683 non-null    category
 7   SUBREGION                                        683 non-null    category
 8   Quantity 1st year                                683 non-null    int64   
dtypes: category(8), int64(1)

So, after dtype conversion, when I try the below SMOTENC, I get an error

print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
cat_index = [0,1,2,3,4,5,6,7]
# import SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
from imblearn.over_sampling import SMOTE, SMOTENC
sm = SMOTENC(categorical_features=cat_index,random_state = 2,sampling_strategy = 'minority')
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

This results in an error as shown below

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py in _unique_python(values, return_inverse) 134 --> 135 uniques = sorted(uniques_set) 136 uniques.extend(missing_values.to_list())

TypeError: '<' not supported between instances of 'str' and 'int'

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last) C:\Users\SATHAP~1\AppData\Local\Temp/ipykernel_31168/1931674352.py in 6 from imblearn.over_sampling import SMOTE, SMOTENC 7 sm = SMOTENC(categorical_features=cat_index,random_state = 2,sampling_strategy = 'minority') ----> 8 X_train_res, y_train_res = sm.fit_resample(X_train, y_train) 9 10 print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))

~\AppData\Roaming\Python\Python39\site-packages\imblearn\base.py in fit_resample(self, X, y) 81 ) 82 ---> 83 output = self.fit_resample(X, y) 84 85 y = (

~\AppData\Roaming\Python\Python39\site-packages\imblearn\over_sampling_smote\base.py in fit_resample(self, X, y) 511 512 # the input of the OneHotEncoder needs to be dense --> 513 X_ohe = self.ohe.fit_transform( 514 X_categorical.toarray() if sparse.issparse(X_categorical) else X_categorical 515 )

~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py in fit_transform(self, X, y) 486 """ 487 self._validate_keywords() --> 488 return super().fit_transform(X, y) 489 490 def transform(self, X):

~\AppData\Roaming\Python\Python39\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params) 850 if y is None: 851 # fit method of arity 1 (unsupervised transformation) --> 852 return self.fit(X, **fit_params).transform(X) 853 else: 854 # fit method of arity 2 (supervised transformation)

~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py in fit(self, X, y) 459 """ 460 self._validate_keywords() --> 461 self.fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan") 462 self.drop_idx = self._compute_drop_idx() 463 return self

~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py in _fit(self, X, handle_unknown, force_all_finite) 92 Xi = X_list[i] 93 if self.categories == "auto": ---> 94 cats = _unique(Xi) 95 else: 96 cats = np.array(self.categories[i], dtype=Xi.dtype)

~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py in _unique(values, return_inverse) 29 """ 30 if values.dtype == object: ---> 31 return _unique_python(values, return_inverse=return_inverse) 32 # numerical 33 out = np.unique(values, return_inverse=return_inverse)

~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py in _unique_python(values, return_inverse) 138 except TypeError: 139 types = sorted(t.qualname for t in set(type(v) for v in values)) --> 140 raise TypeError( 141 "Encoders require their input to be uniformly " 142 f"strings or numbers. Got {types}"

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']

Should I transform y_train into categorical as well? Currently, it is int64.

Help please


Solution

  • Cause of the problem

    SMOTE requires the values in each categorical/numerical column to have uniform datatype. Essentially you can not have mixed datatypes in any of the column in this case your biz_category column. Also merely casting the column to categorical type does not necessarily mean that the values in that column will have uniform datatype.

    Possible solution

    One possible solution to this problem is to re-encode the values in those columns which have mixed data types for example you could use lableencoder but I think in your case simply changing the dtype to string would also work.