I'm using RandomizedSearchCV with RandomForestClassifier in scikit-learn. I want to make sure my results are reproducible across runs. Where should I set the random_state—in the classifier, in RandomizedSearchCV, or both?
Example code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
clf = RandomForestClassifier()
search = RandomizedSearchCV(clf, param_distributions=params, n_iter=10)
What's the best practice to ensure consistent results?
You can perform a simple test using as a starter code given in the RandomizedSearchCV examples. In the code, a random_state
is set both, in the classifier, as well as in the RandomizedSearchCV
. Writing a loop with let's say 50 iterations and printing outcomes, that is .best_params_
will show the following:
setting random_state
both in RandomizedSearchCV
and classifier
/regressor
will always give the same outcome
setting random_state
in just one of those will provide different outcomes across iterations.
So the conclusion is, that if you need reproducibility, you need to set this parameter in both places as in both places separate random generators are used.
Also it is worth to check some more information on the used numbers from this post, as well as official glossary concerning random_state.
The code:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
iris = load_iris()
for i in range(50):
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200 ,random_state=0)
distributions = dict(C=uniform(loc=0, scale=4),
penalty=['l2', 'l1'])
clf = RandomizedSearchCV(logistic, distributions, random_state=0)
search = clf.fit(iris.data, iris.target)
print(search.best_params_)