python pandas data-processing imblearn smote

Imblearn SMOTE: How to set the sample_strategy parameter for a multiclass imbalance dataset?

I'm trying to process a dataset with network attacks that has the following shape:

df.shape
(1074992, 42)

And the labels of the attacks and the normal behaviour are have the following count:

df['Label'].value_counts()
normal            812814
neptune           242149
satan               5019
ipsweep             3723
portsweep           3564
smurf               3007
nmap                1554
back                 968
teardrop             918
warezclient          893
pod                  206
guesspasswd           53
bufferoverflow        30
warezmaster           20
land                  19
imap                  12
rootkit               10
loadmodule             9
ftpwrite               8
multihop               7
phf                    4
perl                   3
spy                    2
Name: Label, dtype: int64

Next I'm splitting the dataset into features and labels.

labels = df['Label']
features = df.loc[:, df.columns != 'Label'].astype('float64')

And then try to work on balancing my dataset.

print("Before UpSampling, counts of label Normal: {}".format(sum(labels == "normal")))
print("Before UpSampling, counts of label Attack: {} \n".format(sum(labels != "normal")))
Before UpSampling, counts of label Normal: 812814
Before UpSampling, counts of label Attack: 262178

So as you can notice the number of attacks is disproportionate to the number of normal behaviors.

I tried using SMOTE to bring the minority(Attack) class to the same value as the majority class (Normal).

sm = SMOTE(k_neighbors = 1,random_state= 42)   #Synthetic Minority Over Sampling Technique
features_res, labels_res = sm.fit_resample(features, labels)
features_res.shape ,labels_res.shape
((18694722, 41), (18694722,))

What I don't understand it's why I'm getting 18694722 values after applying SMOTE.

print("After UpSampling, counts of label Normal: {}".format(sum(labels_res == "normal")))
print("After UpSampling, counts of label Attack: {} \n".format(sum(labels_res != "normal")))
After UpSampling, counts of label Normal: 812814
After UpSampling, counts of label Attack: 17881908

For my case would it be better to downsample the Normal class or to upsample the Attack class? Any ideas on how to properly do it as well?

Thank you very much.

Solution

By default the sampling_strategy of SMOTE is not majority,

'not majority': resample all classes but the majority class

so, if the sample of the majority class is 812814, you'll have

(812814 * 23) = 18694722

samples.

Try passing a dict with the desired number of samples for the minority classes. From the docs

When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

Example

Adapted from the docs, in this example we upsample one of the minority classes to have the same number of samples as the majority class.

from sklearn.datasets import make_classification
from collections import Counter
from imblearn.over_sampling import SMOTE 
X, y = make_classification(n_classes=5, 
    class_sep=2, 
    weights=[0.15, 0.15, 0.1, 0.1, 0.5], 
    n_informative=4, 
    n_redundant=1, 
    flip_y=0,
    n_features=20, 
    n_clusters_per_class=1,
    n_samples=1000,
    random_state=10)

sample_strategy = {4: 500, 0: 500, 1: 150, 2: 100, 3: 100}

sm = SMOTE(sampling_strategy=sample_strategy, random_state=0)
X_res, y_res = sm.fit_resample(X, y)
from collections import Counter
print('Resampled dataset shape %s' % Counter(y_res))
>>>
Resampled dataset shape Counter({4: 500, 0: 500, 1: 150, 3: 100, 2: 100})