scikit-learncross-validationrandom-seed

Randomisation behaviour after cloning and fitting RandomizedSearchCV


I have a basic nested CV loop, where an outer loop goes over an inner model-tuning step. My expectation is that each fold should draw a different random sample of hyperparameter values. However, in the example below, each fold ends up sampling the same values.

Imports and make dataset:

from sklearn.model_selection import RandomizedSearchCV, KFold, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.base import clone

from scipy.stats import uniform
import numpy as np

X, y = make_classification(n_features=10, random_state=np.random.RandomState(0))

Nested CV loop:

#Used for tuning the random forest:
rf_tuner = RandomizedSearchCV(
    RandomForestClassifier(random_state=np.random.RandomState(0)),
    param_distributions=dict(min_samples_split=uniform(0.1, 0.9)),
    n_iter=5,
    cv=KFold(n_splits=2, shuffle=False),
    random_state=np.random.RandomState(0),
    n_jobs=1,
)

#Nested CV
for trn_idx, tst_idx in KFold(3).split(X, y):
    #'cloned' will now share the same RNG as 'rf_tuner'
    cloned = clone(rf_tuner)
    
    #This should be consuming the RNG of 'rf_tuner'
    cloned.fit(X[trn_idx], y[trn_idx])
    
    #Report hyperparameter values sampled in this fold
    display(cloned.cv_results_['params'])

    #<more code for nested CV, not shown>

Output:

Fold 1/3:
[{'min_samples_split': 0.593},
 {'min_samples_split': 0.743},
 {'min_samples_split': 0.642},
 {'min_samples_split': 0.590},
 {'min_samples_split': 0.481}]

Fold 2/3:
[{'min_samples_split': 0.593},
 {'min_samples_split': 0.743},
 {'min_samples_split': 0.642},
 {'min_samples_split': 0.590},
 {'min_samples_split': 0.481}]

Fold 3/3:
[{'min_samples_split': 0.593},
 {'min_samples_split': 0.743},
 {'min_samples_split': 0.642},
 {'min_samples_split': 0.590},
 {'min_samples_split': 0.481}]

I start by instantiating a RandomizedSearchCV with a RandomForestClassifier. I set the random_state= of the search to a random state instance np.random.RandomState(0).

For each pass of the outer loop, I clone() and fit() the search object - cloned should thus be using the same RNG as the original, mutating it at each pass. Each loop ought to yield a different sampling of hyperparameter values. However, as shown above, the hyperparameters sampled at each pass are identical. This suggests that each loop is starting with the same unmodified RNG rather than a mutated one.

The docs say that clones of estimators share the same random state instance:

b = clone(a) [...] calling a.fit will consume b’s RNG, and calling b.fit will consume a’s RNG, since they are the same

What explains the absence of randomisation between folds?


Update

The accepted answer clarifies that the RNG is simply duplicated rather than mutated.

If I want randomness between folds, whilst keeping the script repeatable, one approach would be to remove random_state= from the search object and instead globally set the random seed before running the nested CV.

I think a more canonical approach would be to instantiate an RNG with a fixed seed rng=np.random.RandomState(0), and then in each fold set a new random seed derived from rng:

cloned.set_params(**dict(
  estimator__random_state=rng.randint(10**9),
  random_state=rng.randint(10**9),
)

Solution

  • clone performs a deepcopy on each non-estimator parameter (source), and so in the case of a RandomState the clones will all have different RandomState objects all starting from the same state (in the sense of get_state()). So your example is expected.

    I don't know offhand if this used to behave differently, or if the documentation has always been wrong in this point.