I am puzzled about the right way to use np.random.RandomState
with sklearn.model_selection.RandomizedSearchCV
when running on multiple cores.
I use RandomState
to generate pseudo-random numbers so that my results are reproducible. I give RandomizedSearchCV
an instance of RandomState
and set n_jobs=-1
so that it uses all six cores.
Running on multiple cores introduces an asynchronous element. I expect that this will cause requests for pseudo-random numbers from the various cores to be made in different orders in different runs. Therefore the different runs should give different results, rather than displaying reproducibility.
But in fact the results are reproducible. For a given value of n_iter
(i.e., the number of draws from the parameter space), the best hyper-parameter values found are identical from one run to the next. I also get the same values if n_jobs
is a positive number that is smaller than the number of cores.
To be specific, here is code:
import numpy as np
import scipy.stats as stats
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, train_test_split
# Use RandomState for reproducibility.
random_state = np.random.RandomState(42)
# Get data. Split it into training and test sets.
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.4, random_state=random_state, stratify=y)
# Prepare for hyper-parameter optimization.
n_iter = 1_000
base_clf = GradientBoostingClassifier(
random_state=random_state, max_features='sqrt')
param_space = {'learning_rate': stats.uniform(0.05, 0.2),
'n_estimators': [50, 100, 200],
'subsample': stats.uniform(0.8, 0.2)}
# Generate data folds for cross validation.
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
# Create the search classifier.
search_clf = RandomizedSearchCV(
base_clf, param_space, n_iter=n_iter, scoring='f1_weighted', n_jobs=-1,
cv=skf, random_state=random_state, return_train_score=False)
# Optimize the hyper-parameters and print the best ones found.
search_clf.fit(X_train, y_train)
print('Best params={}'.format(search_clf.best_params_))
I have several questions.
Why do I get reproducible results despite the asynchronous aspect?
The documentation for RandomizedSearchCV
says about the random_state
parameter: "Pseudo random number generator state used for random uniform sampling from lists of possible values instead of scipy.stats distributions." Does this mean that it does not affect the distributions in the parameter space? Is the code above sufficient to ensure reproducibility, or do I need to set np.random.seed()
, or perhaps write something like this:
distn_learning_rate = stats.uniform(0.05, 0.2)
distn_learning_rate.random_state = random_state
distn_subsample = stats.uniform(0.8, 0.2)
distn_subsample.random_state = random_state
param_space = {'learning_rate': distn_learning_rate,
'n_estimators': [50, 100, 200],
'subsample': distn_subsample}
Overall, is this the correct way to set up RandomizedSearchCV
for reproducibility?
Is using a single instance of RandomState
ok, or should I use separate instances for train_test_split
, GradientBoostingClassifier
, StratifiedKFold
, and RandomizedSearchCV
? Also, the documentation of np.random.seed
says that the seed is set when RandomState
is initialized. How does this interact with RandomizedSearchCV
setting the seed?
When n_jobs
is set to use fewer than all the cores, I still see activity on all the cores, though the usage level per core increases and the elapsed time decreases as the number of cores increases. Is this just sklearn and/or macOS optimizing the machine usage?
I am using macOS 10.14.2, Python 3.6.7, Numpy 1.15.4, Scipy 1.1.0, and Sklearn 0.20.1.
The parameter candidates are generated before passing to the multi-threaded functionality using a ParameterSampler object. So only a single random_state
is enough for reproducibility of RandomizedSearchCV.
Note that I said "reproducibility of RandomizedSearchCV"
. For the estimators used inside it (base_clf
here), each estimator should carry its own random_state
as you have done.
Now talking about a single instance of RandomState
, it is perfectly fine for the code which is sequential. Only case to worry is when the multi-processing kicks in. So lets analyze the steps which happen during your program's execution.
RandomState
object with a seed. It has a state now.train_test_split
, a StratifiedShuffleSplit
is used (because you have used stratify
param) which will use the passed RandomState
object to split and generate permutations in train and test data. So the internal state of RandomState
is changed now. But its sequential and nothing to worry.random_state
object in skf
. But no splitting happens until fit()
in RandomizedSearchCV
is called. So state is unchanged.After that, when search_clf.fit
is called, the following happens:
_run_search()
is executed, which will use the random_state
to generate all the parameter combinations at once (according to given n_iters
). So still no part of multi-threading is happening, and everything is good.evaluate_candidates()
is called. The interesting part is this:
out = parallel(delayed(_fit_and_score)(clone(base_estimator),
X, y,
train=train, test=test,
parameters=parameters,
**fit_and_score_kwargs)
for parameters, (train, test)
in product(candidate_params,
cv.split(X, y, groups)))
The part after parallel(delayed(_fit_and_score)
is still sequential which is handled by parent thread.
cv.split()
will use the random_state
(change its state) to generate train test splitsclone(estimator)
will clone all the parameters of the estimator, (the random_state
also). So the changed state of RandomState
from cv.split
object becomes the base state in estimator
RandomState
is cloned to serve the estimator. So the results are reproducible.RandomState
is not used, but each estimator (thread) will have its own copy of RandomState
Hope this is making sense, and answers your question. Scikit-learn explicitly requests the user to set up like this:
import numpy as np
np.random.seed(42)
to make entire execution reproducible, but what you are doing will also do.
I am not entirely sure about your last question as I not able to reproduce that on my system. I have 4 cores and when I set n_jobs=2
or 3
I am only seeing those many cores at 100% and remaining at around 20-30%. My system specs:
System:
python: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0]
machine: Linux-4.15.0-20-generic-x86_64-with-debian-buster-sid
Python deps:
pip: 18.1
setuptools: 40.2.0
sklearn: 0.20.1
numpy: 1.15.4
scipy: 1.1.0
Cython: 0.29
pandas: 0.23.4