pythonmachine-learningmodelcomparisonhyperparameters

Hyper-prparameter tuning and classification algorithm comparation


I have a doubt about classification algorithm comparation.

I am doing a project regarding hyperparameter tuning and classification model comparation for a dataset. The Goal is to find out the best fitted model with the best hyperparameters for my dataset.

For example: I have 2 classification models (SVM and Random Forest), my dataset has 1000 rows and 10 columns (9 columns are features) and 1 last column is lable.

First of all, I splitted dataset into 2 portions (80-10) for training (800 rows) and tesing (200rows) correspondingly. After that, I use Grid Search with CV = 10 to tune hyperparameter on training set with these 2 models (SVM and Random Forest). When hyperparameters are identified for each model, I use these hyperparameters of these 2 models to test Accuracy_score on training and testing set again in order to find out which model is the best one for my data (conditions: Accuracy_score on training set < Accuracy_score on testing set (not overfiting) and which Accuracy_score on testing set of model is higher, that model is the best model).

However, SVM shows the accuracy_score of training set is 100 and the accuracy_score of testing set is 83.56, this means SVM with tuning hyperparameters is overfitting. On the other hand, Random Forest shows the accuracy_score of training set is 72.36 and the accuracy_score of testing set is 81.23. It is clear that the accuracy_score of testing set of SVM is higher than the accuracy_score of testing set of Random Forest, but SVM is overfitting.

I have some question as below:

_ Is my method correst when I implement comparation of accuracy_score for training and testing set as above instead of using Cross-Validation? (if use Cross-Validation, how to do it?

_ It is clear that SVM above is overfitting but its accuracy_score of testing set is higher than accuracy_score of testing set of Random Forest, could I conclude that SVM is a best model in this case?

Thank you!


Solution

  • It's good that you've done quite an analysis on your part to investigate the best model. However, I would suggest you elaborate on your investigation a bit. As you're searching for the best model for your data, "Accuracy" alone is not a good evaluation metric for your models. You should also evaluate your model on "Precision Score", "Recall Score", "ROC", "Sensitivity", "Specificity" etc. Find out if your data has imbalance (If they do, there're techniques to work 'em around). After evaluating all those metrics you may come up with a decision.

    For the training-testing part, you're quite on the right track, with only one issue (which is quite severe), the time you're testing your model on the test set, you're injecting a sort of bias. So I would say make 3 partitions of your data, and use cross-validation (sklearn has got what you need for this) on your "training set", after cross-validation, you may use another partition "validation set" for testing the generalization power of your model (performance on unseen data), you may change some parameter after that. And after you've come up to a conclusion and tuning everything you needed to, only then use your "test set". No matter what the results are (on the test set) don't change the model after that, as those scores represent the true capability of your model.

    you can create 3 partitions of your data in the following way for example-

    from sklearn.model_selection import train_test_split
     from sklearn.datasets import make_blobs
     
    #  Dummy dataset for example purpose
     X, y = make_blobs(n_samples=1000, centers=2, n_features=2, cluster_std=6.0)
    
    # first partition i.e. "train-set" and "test-set"
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state=123)
    
    # second partition, we're splitting the "train-set" into 2 sets, thus creating a new partition of "train-set" and "validation-set"
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.9, random_state=123)
    
    print(X_train.shape, X_test.shape, X_val.shape)  # output : ((810, 2), (100, 2), (90, 2))