When running hyperparameter tuning on a random forest, I sometimes want to specify a large integer range of values for integer parameters like min_samples_leaf
(e.g. ranging from the default value of 1 up to 100).
Whilst I could specify this range using scipy.stats.randint(1, 100)
, I'd prefer to use a log-uniform distribution as my range covers two orders of magnitude. SciPy has stats.loguniform
for continuous rvs, but doesn't seem to have a discretised equivalent.
A quick solution for approximating the discretised space is to just sample lots of continuous values and then convert the samples to integers:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import loguniform
import numpy as np
#Draw lots of samples and discretise them, in order to approximate
# a discretised loguniform sample space
def discretised_loguniform_samples(low, high, seed=None, sample_size=100_000):
float_rvs = loguniform(low, high, seed=seed).rvs(n_samples)
return float_rvs.round().astype(int)
Usage:
rf_param_distributions = {
'min_samples_leaf': discretised_loguniform_samples(low=1, high=100, seed=0),
...
}
#This will draw n_iter=10 samples from the fixed list of integers created above.
# The list from which the samples are drawn is fixed in advance and therefore
# can't exploit the randomness imparted by the consumable random_state argument.
RandomizedSearchCV(
estimator=RandomForestRegressor(random_state=np.random.RandomState(0)),
param_distributions=rf_param_distributions,
n_iter=10,
random_state=np.random.RandomState(0),
...
)
The downside is that when I define the list of integers in advance, that gives me only a single static space (however large) that is fixed throughout the tuning process. I want to exploit the randomness imparted by the consumable random_seed=
in RandomizedSearchCV
*, rather than being limited to a pre-defined list.
How can I modify loguniform
in such a way that I get a discretised version of its samples for each call to rvs()
?
*RandomizedSearchCV
passes down its random_state=
parameter to the distribution's rvs()
method. The docs seem ambiguous on this point, stating that random_state=
is "used for sampling from lists of possible values instead of scipy.stats distributions".
The approach below simply decorates/wraps loguniform.rvs()
with a float-to-int function:
def float_to_int(rvs):
def rvs_wrapper(*args, **kwargs):
return rvs(*args, **kwargs).round().astype(int)
return rvs_wrapper
def int_loguniform(low, high):
#Create a loguniform object
lu = loguniform(low, high)
#Wrap its rvs() with float-to-int
lu.rvs = float_to_int(lu.rvs)
#Return modified loguniform object
return lu
Usage:
rf_param_distributions = {
'min_samples_leaf': int_loguniform(low=1, high=100),
...
}
#Each iteration will consume the supplied random_state,
# We are no longer limited to drawing samples from a fixed list.
RandomizedSearchCV(
estimator=RandomForestRegressor(random_state=np.random.RandomState(0)),
param_distributions=rf_param_distributions,
n_iter=10,
random_state=np.random.RandomState(0),
...
)