pythonmachine-learningclassificationoversamplingsmote

Defore oversampling data showing 0


I am working on my dataset and quite new to this. Below is the code:

class_col_name='Creditability' 

feature_names=df.columns[df.columns != class_col_name ]
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(df.loc[:, feature_names], df[class_col_name], test_size=0.3,random_state=1) 
print("Number transactions X_train dataset: ", X_train.shape) 
print("Number transactions y_train dataset: ", y_train.shape) 
print("Number transactions X_test dataset: ", X_test.shape) 
print("Number transactions y_test dataset: ", y_test.shape) 

print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0))) 

I am trying to apply oversampling on my dataset, but when I count it before oversampling it says 0 in the output but it do show me that dataset has data:

Below is the output:

Number transactions X_train dataset:  (700, 20)
Number transactions y_train dataset:  (700,)
Number transactions X_test dataset:  (300, 20)
Number transactions y_test dataset:  (300,)
Before OverSampling, counts of label '1': 0
Before OverSampling, counts of label '0': 0 

I am trying to understand the output and work on it.


Solution

  • You might want to confirm that the possible class labels are in fact 0 and 1. You could try

    print(y_train.unique())
    

    to check what the class labels are.

    If y_train is a pandas Series with labels in [0, 1], then I believe the results of the last two lines should in fact sum to the size of y_train. If the labels are not in the integers 0 or 1 then that would explain why the sums are both 0.