pythonscikit-learndata-science

How to set a fixed random state in RandomizedSearchCV?


I'm using RandomizedSearchCV with RandomForestClassifier in scikit-learn. I want to make sure my results are reproducible across runs. Where should I set the random_state—in the classifier, in RandomizedSearchCV, or both?

Example code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

clf = RandomForestClassifier()
search = RandomizedSearchCV(clf, param_distributions=params, n_iter=10)

What's the best practice to ensure consistent results?


Solution

  • You can perform a simple test using as a starter code given in the RandomizedSearchCV examples. In the code, a random_state is set both, in the classifier, as well as in the RandomizedSearchCV. Writing a loop with let's say 50 iterations and printing outcomes, that is .best_params_ will show the following:

    So the conclusion is, that if you need reproducibility, you need to set this parameter in both places as in both places separate random generators are used.

    Also it is worth to check some more information on the used numbers from this post, as well as official glossary concerning random_state.

    The code:

    from sklearn.datasets import load_iris
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import RandomizedSearchCV
    from scipy.stats import uniform
    
    iris = load_iris()
    for i in range(50):
        logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200 ,random_state=0) 
    
        distributions = dict(C=uniform(loc=0, scale=4),
                         penalty=['l2', 'l1'])
    
        clf = RandomizedSearchCV(logistic, distributions, random_state=0)
    
        search = clf.fit(iris.data, iris.target)
        print(search.best_params_)