python-3.x machine-learning imbalanced-data imblearn smote

Using SMOTE-NC with categorical variables only

I am dealing with a dataframe containing only categorical features. To reproduce the issue I am facing I am going to make the following example:

d = {'col1':['a','b','c','a','c','c','c','c','c','c'],
     'col2':['a1','b1','c1','a1','c1','c1','c1','c1','c1','c1'],
     'col3':[1,2,3,2,3,3,3,3,3,3]}
data = pd.DataFrame(d)

I am going to split the data into test and train and take col3 as my target feature.

train_data, test_data = train_test_split(data, test_size=0.2)
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

X_train = train_data.drop(['col3'], axis = 1)
X_test = test_data.drop(['col3'], axis = 1)
y_train = train_data["col3"]
y_test = test_data["col3"]

From X_train, col1 and col2 are my categorical features so index 0 and 1, hence I do SMOTE-NC as:

from imblearn.over_sampling import SMOTENC
cat_indx =[0,1]
sm = SMOTENC(categorical_features= cat_indx, random_state=0)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

for which I get the following error:

ValueError: SMOTE-NC is not designed to work only with categorical features. It requires some numerical features.

I wonder how one does tackle this issue given the fact that SMOTE-NC is meant to be for handling the categorical variables? Also note that my target variable is multiclass and not binary, which I do not think causes any problem at this level.

Solution

Notice that the very initials NC in the algorithm name mean Nominal-Continuous; as the error message clearly states, the algorithm is not designed to work with categorical (nominal) features only.

To see why this is so, you have to dig a little into the original SMOTE paper; quoting from the relevant section (emphasis mine):

While our SMOTE approach currently does not handle data sets with all nominal features, it was generalized to handle mixed datasets of continuous and nominal features. We call this approach Synthetic Minority Over-sampling TEchnique-Nominal Continuous [SMOTE-NC]. We tested this approach on the Adult dataset from the UCI repository. The SMOTE-NC algorithm is described below.

Median computation: Compute the median of standard deviations of all continuous features for the minority class. If the nominal features differ between a sample and its potential nearest neighbors, then this median is included in the Euclidean distance computation. We use median to penalize the difference of nominal features by an amount that is related to the typical difference in continuous feature values.

Nearest neighbor computation: Compute the Euclidean distance between the feature vector for which k-nearest neighbors are being identified (minority class sample) and the other feature vectors (minority class samples) using the continuous feature space. For every differing nominal feature between the considered feature vector and its potential nearest-neighbor, include the median of the standard deviations previously computed, in the Euclidean distance computation.

So, it is apparent that, in order for the algorithm to work, it needs at least one continuous feature. This is not the case here, so the algorithm rather unsurprisingly fails during step 1 (median computation), since there are not any continuous features to be used for the median computation.