I am using Sklearn.train_test_split and sklearn.MLPClassifier for human activity recognition. Below is my dataset in a pandas df:
a_x a_y a_z g_x g_y g_z activity
0 3.058150 5.524902 -7.415221 0.001280 -0.022299 -0.009420 sit
1 3.065333 5.524902 -7.422403 -0.003514 -0.023764 -0.007289 sit
2 3.065333 5.524902 -7.422403 -0.003514 -0.023764 -0.007289 sit
3 3.064734 5.534479 -7.406840 -0.016830 -0.025628 -0.003294 sit
4 3.074910 5.548246 -7.408038 -0.023488 -0.025495 -0.001963 sit
... ... ... ... ... ... ... ...
246886 8.102990 -1.226492 -4.559391 -0.511287 0.081455 0.109515 run
246887 8.120349 -1.218711 -4.595306 -0.516480 0.089179 0.110047 run
246888 8.126933 -1.209732 -4.619848 -0.521940 0.096636 0.109382 run
246889 8.140102 -1.199556 -4.622840 -0.526467 0.102761 0.108183 run
246890 8.142496 -1.199556 -4.648580 -0.530728 0.109818 0.108050 run
1469469 rows × 7 columns
I am using the 6 numerical columns (x,y,z from accelerometer and gyrosphere) to predict activity (run, sit, walk). My code looks like
mlp=MLPClassifier(hidden_layer_sizes=(10,), activation='relu', solver='adam', learning_rate='adaptive',
early_stopping=True, learning_rate_init=.001)
X=HAR.drop(columns='activity').to_numpy()
y=HAR['activity'].to_numpy()
X_train, X_test, y_train, y_test=train_test_split(X,y, train_size=0.10)
mlp.fit(X_train, y_train)
predictions_train=mlp.predict(X_train)
predictions_test=mlp.predict(X_test)
print("Fitting of train data for size (10,): \n",classification_report(y_train,predictions_train))
print("Fitting of test data for size (10,): \n",classification_report(y_test,predictions_test))
Output is:
Fitting of train data for size (10,):
precision recall f1-score support
run 1.00 1.00 1.00 49265
sit 1.00 1.00 1.00 49120
walk 1.00 1.00 1.00 48561
accuracy 1.00 146946
macro avg 1.00 1.00 1.00 146946
weighted avg 1.00 1.00 1.00 146946
Fitting of test data for size (10,):
precision recall f1-score support
run 1.00 1.00 1.00 441437
sit 1.00 1.00 1.00 442540
walk 1.00 1.00 1.00 438546
accuracy 1.00 1322523
macro avg 1.00 1.00 1.00 1322523
weighted avg 1.00 1.00 1.00 1322523
I am relatively new to ML but I think I understand the concept of overfitting, so I imagine that is what is happening here, but I don't understand how it is being overfit when it is only being trained on 10% of the dataset? Also, presumably the classification report should always be perfect for the X_train data since that is what the model is being trained on, correct?
No matter what I do, it always produces a perfect classification_report for the X_test data no matter how little data I train it on (in this case .10 but i've done .25, .5, .33 etc.). I even removed the gyrosphere data and only trained it on the accelerometer data and it still gave a perfect 1 for each precision, recall, and F1.
When I arbitrarily slice the original dataset in half and use the resulting arrays as train and test data then the predictions for X_test are not perfect but every time I use the sklearn.train_test_split it returns a perfect classification report....So i assume I am doing something wrong with how I am using train_test_split?
It's quite hard to say without having access to the data to try out.
I wonder if within the data itself, the class separation is really clear such that the classifier has no trouble distinguishing. (It seems so just seeing the values you printed.. The distributions are very different and well separated if you plot them. So to be fair a NN is overkill, if even by visual plotting we are able to clearly distinguish different activities.)
Have you tried smaller hidden layer sizes, say only 1 or 2 nodes, or some other simpler classifier? E.g. decision tree with max_depth
set, say to <4, or just a logistic regression model.
Also did you try stratifying: train_test_split(X,y, train_size=0.10, stratify=y)
My guess, I think it's just a very simple dataset, thus the classifier is doing very well because the class separations are so clear. So it's nothing to do with overfitting.