I already referred the posts here, here and here. Don't mark it as duplicate.
I am working on a binary classification problem where my dataset has categorical and numerical columns.
However, some of the categorical columns has a mix of numeric and string values. Nontheless, they only indicate the category name.
For instance, I have a column called biz_category
which has values like A,B,C,4,5
etc.
I guess the below error is thrown due to values like 4 and 5
.
Therefore, I tried the belowm to convert them into category
datatype. (but still it doesn't work)
cols=X_train.select_dtypes(exclude='int').columns.to_list()
X_train[cols]=X_train[cols].astype('category')
And my data info looks like below
<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 21 to 965
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Feature_A 683 non-null category
1 Product Classification 683 non-null category
2 Industry 683 non-null category
3 DIVISION 683 non-null category
4 biz_category 683 non-null category
5 Country 683 non-null category
6 Product segment 683 non-null category
7 SUBREGION 683 non-null category
8 Quantity 1st year 683 non-null int64
dtypes: category(8), int64(1)
So, after dtype conversion, when I try the below SMOTENC, I get an error
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
cat_index = [0,1,2,3,4,5,6,7]
# import SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
from imblearn.over_sampling import SMOTE, SMOTENC
sm = SMOTENC(categorical_features=cat_index,random_state = 2,sampling_strategy = 'minority')
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
This results in an error as shown below
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py in _unique_python(values, return_inverse) 134 --> 135 uniques = sorted(uniques_set) 136 uniques.extend(missing_values.to_list())
TypeError: '<' not supported between instances of 'str' and 'int'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last) C:\Users\SATHAP~1\AppData\Local\Temp/ipykernel_31168/1931674352.py in 6 from imblearn.over_sampling import SMOTE, SMOTENC 7 sm = SMOTENC(categorical_features=cat_index,random_state = 2,sampling_strategy = 'minority') ----> 8 X_train_res, y_train_res = sm.fit_resample(X_train, y_train) 9 10 print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
~\AppData\Roaming\Python\Python39\site-packages\imblearn\base.py in fit_resample(self, X, y) 81 ) 82 ---> 83 output = self.fit_resample(X, y) 84 85 y = (
~\AppData\Roaming\Python\Python39\site-packages\imblearn\over_sampling_smote\base.py in fit_resample(self, X, y) 511 512 # the input of the OneHotEncoder needs to be dense --> 513 X_ohe = self.ohe.fit_transform( 514 X_categorical.toarray() if sparse.issparse(X_categorical) else X_categorical 515 )
~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py in fit_transform(self, X, y) 486 """ 487 self._validate_keywords() --> 488 return super().fit_transform(X, y) 489 490 def transform(self, X):
~\AppData\Roaming\Python\Python39\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params) 850 if y is None: 851 # fit method of arity 1 (unsupervised transformation) --> 852 return self.fit(X, **fit_params).transform(X) 853 else: 854 # fit method of arity 2 (supervised transformation)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py in fit(self, X, y) 459 """ 460 self._validate_keywords() --> 461 self.fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan") 462 self.drop_idx = self._compute_drop_idx() 463 return self
~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py in _fit(self, X, handle_unknown, force_all_finite) 92 Xi = X_list[i] 93 if self.categories == "auto": ---> 94 cats = _unique(Xi) 95 else: 96 cats = np.array(self.categories[i], dtype=Xi.dtype)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py in _unique(values, return_inverse) 29 """ 30 if values.dtype == object: ---> 31 return _unique_python(values, return_inverse=return_inverse) 32 # numerical 33 out = np.unique(values, return_inverse=return_inverse)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py in _unique_python(values, return_inverse) 138 except TypeError: 139 types = sorted(t.qualname for t in set(type(v) for v in values)) --> 140 raise TypeError( 141 "Encoders require their input to be uniformly " 142 f"strings or numbers. Got {types}"
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']
Should I transform y_train
into categorical as well? Currently, it is int64
.
Help please
SMOTE
requires the values in each categorical/numerical column to have uniform datatype. Essentially you can not have mixed datatypes in any of the column in this case your biz_category
column. Also merely casting the column to categorical type does not necessarily mean that the values in that column will have uniform datatype.
One possible solution to this problem is to re-encode the values in those columns which have mixed data types for example you could use lableencoder but I think in your case simply changing the dtype
to string
would also work.