pythonpython-3.xmachine-learningscikit-learnrandom-forest

Why is Random Search showing better results than Grid Search?


I'm playing with RandomizedSearchCV function from scikit-learn. Some academic paper claims that Randomized Search can provide 'good enough' results comparing with a whole grid search, but saves a lot of time.

Surprisingly, on one occasion, the RandomizedSearchCV provided me better results than GridSearchCV. I think GridSearchCV is suppose to be exhaustive, so the result has to be better than RandomizedSearchCV suppose they search through the same grid.

for the same dataset and mostly same settings, GridsearchCV returned me the following result:

Best cv accuracy: 0.7642857142857142  
Test set score:   0.725  
Best parameters:  'C': 0.02  

the RandomizedSearchCV returned me the following result:

Best cv accuracy: 0.7428571428571429  
Test set score:   0.7333333333333333  
Best parameters:  'C': 0.008  

To me the test score of 0.733 is better than 0.725, and the difference between test score and training score for the RandomizedSearchCV is smaller, which to my knowledge means less overfitting.

So why did GridSearchCV return me worse results?

GridSearchCV code:

def linear_SVC(x, y, param, kfold):
    param_grid = {'C':param}
    k = KFold(n_splits=kfold, shuffle=True, random_state=0)
    grid = GridSearchCV(LinearSVC(), param_grid=param_grid, cv=k, n_jobs=4, verbose=1)
    
    return grid.fit(x, y)

#high C means more chance of overfitting

start = timer()
param = [i/1000 for i in range(1,1000)]
param1 = [i for i in range(1,101)]
param.extend(param1)

#progress = progressbar.bar.ProgressBar()
clf = linear_SVC(x=x_train, y=y_train, param=param, kfold=3)

print('LinearSVC:')
print('Best cv accuracy: {}' .format(clf.best_score_))
print('Test set score:   {}' .format(clf.score(x_test, y_test)))
print('Best parameters:  {}' .format(clf.best_params_))
print()

duration = timer() - start
print('time to run: {}' .format(duration))

RandomizedSearchCV code:

from sklearn.model_selection import RandomizedSearchCV

def Linear_SVC_Rand(x, y, param, kfold, n):
    param_grid = {'C':param}
    k = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=0)
    randsearch = RandomizedSearchCV(LinearSVC(), param_distributions=param_grid, cv=k, n_jobs=4,
                                    verbose=1, n_iter=n)
    
    return randsearch.fit(x, y)

start = timer()
param = [i/1000 for i in range(1,1000)]
param1 = [i for i in range(1,101)]
param.extend(param1)

#progress = progressbar.bar.ProgressBar()
clf = Linear_SVC_Rand(x=x_train, y=y_train, param=param, kfold=3, n=100)

print('LinearSVC:')
print('Best cv accuracy: {}' .format(clf.best_score_))
print('Test set score:   {}' .format(clf.score(x_test, y_test)))
print('Best parameters:  {}' .format(clf.best_params_))
print()

duration = timer() - start
print('time to run: {}' .format(duration))

Solution

  • First, try to understand this: https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation

    So you should know that StratifiedKFold is better than KFold.

    Use StratifiedKFold in both GridSearchCV and RandomizedSearchCV. Make sure to set "shuffle = False" and not use "random_state" parameter. What this does: the dataset you are using will not be shuffled so that your results won't be changed each time you train it. You might get what you expect.

    GridSearchCV code:

    def linear_SVC(x, y, param, kfold):
        param_grid = {'C':param}
        k = StratifiedKFold(n_splits=kfold)
        grid = GridSearchCV(LinearSVC(), param_grid=param_grid, cv=k, n_jobs=4, verbose=1)
    
        return grid.fit(x, y)
    

    RandomizedSearchCV code:

    def Linear_SVC_Rand(x, y, param, kfold, n):
        param_grid = {'C':param}
        k = StratifiedKFold(n_splits=kfold)
        randsearch = RandomizedSearchCV(LinearSVC(), param_distributions=param_grid, cv=k, n_jobs=4,
                                        verbose=1, n_iter=n)
    
        return randsearch.fit(x, y)