pythonnumpyscikit-learnscipynumpy-random

How to use RandomState with Sklearn RandomizedSearchCV on multiple cores


I am puzzled about the right way to use np.random.RandomState with sklearn.model_selection.RandomizedSearchCV when running on multiple cores.

I use RandomState to generate pseudo-random numbers so that my results are reproducible. I give RandomizedSearchCV an instance of RandomState and set n_jobs=-1 so that it uses all six cores.

Running on multiple cores introduces an asynchronous element. I expect that this will cause requests for pseudo-random numbers from the various cores to be made in different orders in different runs. Therefore the different runs should give different results, rather than displaying reproducibility.

But in fact the results are reproducible. For a given value of n_iter (i.e., the number of draws from the parameter space), the best hyper-parameter values found are identical from one run to the next. I also get the same values if n_jobs is a positive number that is smaller than the number of cores.

To be specific, here is code:

import numpy as np
import scipy.stats as stats
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, train_test_split

# Use RandomState for reproducibility.
random_state = np.random.RandomState(42)

# Get data. Split it into training and test sets.
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=random_state, stratify=y)

# Prepare for hyper-parameter optimization.
n_iter = 1_000

base_clf = GradientBoostingClassifier(
    random_state=random_state, max_features='sqrt')

param_space = {'learning_rate': stats.uniform(0.05, 0.2),
               'n_estimators': [50, 100, 200],
               'subsample': stats.uniform(0.8, 0.2)}

# Generate data folds for cross validation.
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)

# Create the search classifier.
search_clf = RandomizedSearchCV(
    base_clf, param_space, n_iter=n_iter, scoring='f1_weighted', n_jobs=-1, 
    cv=skf, random_state=random_state, return_train_score=False)

# Optimize the hyper-parameters and print the best ones found.
search_clf.fit(X_train, y_train)
print('Best params={}'.format(search_clf.best_params_))

I have several questions.

  1. Why do I get reproducible results despite the asynchronous aspect?

  2. The documentation for RandomizedSearchCV says about the random_state parameter: "Pseudo random number generator state used for random uniform sampling from lists of possible values instead of scipy.stats distributions." Does this mean that it does not affect the distributions in the parameter space? Is the code above sufficient to ensure reproducibility, or do I need to set np.random.seed(), or perhaps write something like this:

    distn_learning_rate = stats.uniform(0.05, 0.2)  
    distn_learning_rate.random_state = random_state  
    distn_subsample = stats.uniform(0.8, 0.2)  
    distn_subsample.random_state = random_state  
    param_space = {'learning_rate': distn_learning_rate,  
                   'n_estimators': [50, 100, 200],  
                   'subsample': distn_subsample}  
    
  3. Overall, is this the correct way to set up RandomizedSearchCV for reproducibility?

  4. Is using a single instance of RandomState ok, or should I use separate instances for train_test_split, GradientBoostingClassifier, StratifiedKFold, and RandomizedSearchCV? Also, the documentation of np.random.seed says that the seed is set when RandomState is initialized. How does this interact with RandomizedSearchCV setting the seed?

  5. When n_jobs is set to use fewer than all the cores, I still see activity on all the cores, though the usage level per core increases and the elapsed time decreases as the number of cores increases. Is this just sklearn and/or macOS optimizing the machine usage?

I am using macOS 10.14.2, Python 3.6.7, Numpy 1.15.4, Scipy 1.1.0, and Sklearn 0.20.1.


Solution

  • The parameter candidates are generated before passing to the multi-threaded functionality using a ParameterSampler object. So only a single random_state is enough for reproducibility of RandomizedSearchCV.

    Note that I said "reproducibility of RandomizedSearchCV". For the estimators used inside it (base_clf here), each estimator should carry its own random_state as you have done.

    Now talking about a single instance of RandomState, it is perfectly fine for the code which is sequential. Only case to worry is when the multi-processing kicks in. So lets analyze the steps which happen during your program's execution.

    1. You set up a RandomState object with a seed. It has a state now.
    2. Inside train_test_split, a StratifiedShuffleSplit is used (because you have used stratify param) which will use the passed RandomState object to split and generate permutations in train and test data. So the internal state of RandomState is changed now. But its sequential and nothing to worry.
    3. Now you set this random_state object in skf. But no splitting happens until fit() in RandomizedSearchCV is called. So state is unchanged.
    4. After that, when search_clf.fit is called, the following happens:

      1. _run_search() is executed, which will use the random_state to generate all the parameter combinations at once (according to given n_iters). So still no part of multi-threading is happening, and everything is good.
      2. evaluate_candidates() is called. The interesting part is this:

        out = parallel(delayed(_fit_and_score)(clone(base_estimator),
                                                   X, y,
                                                   train=train, test=test,
                                                   parameters=parameters,
                                                   **fit_and_score_kwargs)
                           for parameters, (train, test)
                           in product(candidate_params,
                                      cv.split(X, y, groups)))
        
      3. The part after parallel(delayed(_fit_and_score) is still sequential which is handled by parent thread.

        • cv.split() will use the random_state (change its state) to generate train test splits
        • clone(estimator) will clone all the parameters of the estimator, (the random_state also). So the changed state of RandomState from cv.split object becomes the base state in estimator
        • The above two steps happen multiple times (number of splits x parameter combinations times) from parent thread (without asynchronicity). And each time the original RandomState is cloned to serve the estimator. So the results are reproducible.
        • So when the actual multi-threading part is started, the original RandomState is not used, but each estimator (thread) will have its own copy of RandomState

    Hope this is making sense, and answers your question. Scikit-learn explicitly requests the user to set up like this:

    import numpy as np
    np.random.seed(42)
    

    to make entire execution reproducible, but what you are doing will also do.

    I am not entirely sure about your last question as I not able to reproduce that on my system. I have 4 cores and when I set n_jobs=2 or 3 I am only seeing those many cores at 100% and remaining at around 20-30%. My system specs:

    System:
        python: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)  [GCC 7.2.0]
       machine: Linux-4.15.0-20-generic-x86_64-with-debian-buster-sid
    
    Python deps:
           pip: 18.1
    setuptools: 40.2.0
       sklearn: 0.20.1
         numpy: 1.15.4
         scipy: 1.1.0
        Cython: 0.29
        pandas: 0.23.4