scikit-learntraining-datatrain-test-splitmlp

Can someone help explain why my MLP keeps on getting a perfect classification report?


I am using Sklearn.train_test_split and sklearn.MLPClassifier for human activity recognition. Below is my dataset in a pandas df:


a_x a_y a_z g_x g_y g_z activity
0   3.058150    5.524902    -7.415221   0.001280    -0.022299   -0.009420   sit
1   3.065333    5.524902    -7.422403   -0.003514   -0.023764   -0.007289   sit
2   3.065333    5.524902    -7.422403   -0.003514   -0.023764   -0.007289   sit
3   3.064734    5.534479    -7.406840   -0.016830   -0.025628   -0.003294   sit
4   3.074910    5.548246    -7.408038   -0.023488   -0.025495   -0.001963   sit
... ... ... ... ... ... ... ...
246886  8.102990    -1.226492   -4.559391   -0.511287   0.081455    0.109515    run
246887  8.120349    -1.218711   -4.595306   -0.516480   0.089179    0.110047    run
246888  8.126933    -1.209732   -4.619848   -0.521940   0.096636    0.109382    run
246889  8.140102    -1.199556   -4.622840   -0.526467   0.102761    0.108183    run
246890  8.142496    -1.199556   -4.648580   -0.530728   0.109818    0.108050    run
1469469 rows × 7 columns

I am using the 6 numerical columns (x,y,z from accelerometer and gyrosphere) to predict activity (run, sit, walk). My code looks like

mlp=MLPClassifier(hidden_layer_sizes=(10,), activation='relu', solver='adam', learning_rate='adaptive', 
                 early_stopping=True, learning_rate_init=.001)

X=HAR.drop(columns='activity').to_numpy()
y=HAR['activity'].to_numpy()

X_train, X_test, y_train, y_test=train_test_split(X,y, train_size=0.10)

mlp.fit(X_train, y_train)
predictions_train=mlp.predict(X_train)
predictions_test=mlp.predict(X_test)

print("Fitting of train data for size (10,): \n",classification_report(y_train,predictions_train))
print("Fitting of test data for size (10,): \n",classification_report(y_test,predictions_test))

Output is:

Fitting of train data for size (10,): 
               precision    recall  f1-score   support

         run       1.00      1.00      1.00     49265
         sit       1.00      1.00      1.00     49120
        walk       1.00      1.00      1.00     48561

    accuracy                           1.00    146946
   macro avg       1.00      1.00      1.00    146946
weighted avg       1.00      1.00      1.00    146946

Fitting of test data for size (10,): 
               precision    recall  f1-score   support

         run       1.00      1.00      1.00    441437
         sit       1.00      1.00      1.00    442540
        walk       1.00      1.00      1.00    438546

    accuracy                           1.00   1322523
   macro avg       1.00      1.00      1.00   1322523
weighted avg       1.00      1.00      1.00   1322523

I am relatively new to ML but I think I understand the concept of overfitting, so I imagine that is what is happening here, but I don't understand how it is being overfit when it is only being trained on 10% of the dataset? Also, presumably the classification report should always be perfect for the X_train data since that is what the model is being trained on, correct?

No matter what I do, it always produces a perfect classification_report for the X_test data no matter how little data I train it on (in this case .10 but i've done .25, .5, .33 etc.). I even removed the gyrosphere data and only trained it on the accelerometer data and it still gave a perfect 1 for each precision, recall, and F1.

When I arbitrarily slice the original dataset in half and use the resulting arrays as train and test data then the predictions for X_test are not perfect but every time I use the sklearn.train_test_split it returns a perfect classification report....So i assume I am doing something wrong with how I am using train_test_split?


Solution

  • It's quite hard to say without having access to the data to try out.

    I wonder if within the data itself, the class separation is really clear such that the classifier has no trouble distinguishing. (It seems so just seeing the values you printed.. The distributions are very different and well separated if you plot them. So to be fair a NN is overkill, if even by visual plotting we are able to clearly distinguish different activities.)

    Have you tried smaller hidden layer sizes, say only 1 or 2 nodes, or some other simpler classifier? E.g. decision tree with max_depth set, say to <4, or just a logistic regression model.

    Also did you try stratifying: train_test_split(X,y, train_size=0.10, stratify=y)

    My guess, I think it's just a very simple dataset, thus the classifier is doing very well because the class separations are so clear. So it's nothing to do with overfitting.