[SOLVED] Issue Validating RandomizedSearchCV Results

Issue Validating RandomizedSearchCV Results

I start with a basic Logistic Regression, using all defaults hyper-parameters. And I get a score of 0.8855

Question Next I run a RandomSearch to find the best hyper-parameters; According to the RandomSearch C=10 with Max_iterations=110 gives the score of 0.89

I run the logistic with these hyper parameters but get a much better accuracy, 0.91 !

Why am I not getting exactly the same number?

Solution

You will definitely not get the same accuracy when you run it again in your train set, this is because when you do k-fold cross validation to check the performance of a particular set of hyper parameters you will divide the entire data into k sets and use k-1 sets for training and validate it on the left over one set. And you repeat this process k times and each time you take a different set of data for validating. And finally you compute the average of all the k iterations and report your accuracy which is what you got in random_result.best_score_, the figure below explains the process

And now after getting the best set of hyperparameters you will fit it on the entire training data i.e. set 1, set 2 and set 3, so now it is prone to have some variations since the data has changed and you are evaluating on the entire train data. So what you observe is totally normal and the usual behavior.