I am trying to use PyGAD to optimize hyper-parameters in ML models. According to documentation
The gene_space parameter customizes the space of values of each gene ... list, tuple, numpy.ndarray, or any range like range, numpy.arange(), or numpy.linspace: It holds the space for each individual gene. But this space is usually discrete. That is there is a set of finite values to select from.
As you can see, the first element of gene_space
, which corresponds to solution[0]
in the Genetic Algorithm definition, is an array of integers. According to documentation, this should be a discrete space, which it is. However, when this array of integers (from np.linspace
, which is okay to use), it is interpreted by Random Forest Classifier as a numpy.float64'>
(see error in 3rd code block.)
I don't understand where this change of data type is occurring. Is this a PyGAD problem and how can I fix? Or is it a numpy -> sklearn problem?
gene_space = [
# n_estimators
np.linspace(50,200,25, dtype='int'),
# min_samples_split,
np.linspace(2,10,5, dtype='int'),
# min_samples_leaf,
np.linspace(1,10,5, dtype='int'),
# min_impurity_decrease
np.linspace(0,1,10, dtype='float')
]
The definition of the Genetic Algorithm
def fitness_function_factory(data=data, y_name='y', sample_size=100):
def fitness_function(solution, solution_idx):
model = RandomForestClassifier(
n_estimators=solution[0],
min_samples_split=solution[1],
min_samples_leaf=solution[2],
min_impurity_decrease=solution[3]
)
X = data.drop(columns=[y_name])
y = data[y_name]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.5)
train_idx = sample_without_replacement(n_population=len(X_train),
n_samples=sample_size)
test_idx = sample_without_replacement(n_population=len(X_test),
n_samples=sample_size)
model.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
fitness = model.score(X_test.iloc[test_idx], y_test.iloc[test_idx])
return fitness
return fitness_function
And the instantiation of the Genetic Algorithm
cross_validate = pygad.GA(gene_space=gene_space,
fitness_func=fitness_function_factory(),
num_generations=100,
num_parents_mating=2,
sol_per_pop=8,
num_genes=len(gene_space),
parent_selection_type='sss',
keep_parents=2,
crossover_type="single_point",
mutation_type="random",
mutation_percent_genes=25)
cross_validate.best_solution()
>>>
ValueError: n_estimators must be an integer, got <class 'numpy.float64'>.
Any recommendations on resolving this error?
EDIT: I've tried the below to successful results:
model = RandomForestClassifier(n_estimators=gene_space[0][0])
model.fit(X,y)
So the issue does not lie with numpy->sklearn but with PyGAD.
There're 2 issues I've spotted here:
pygad.GA does not derive the numerical type out of the relevant gene values of "gene_space" and simply convert all the numerical values into 'float'.
In order to fix this, the "gene_type" parameter must be used to specify the respected types of gene values.
https://pygad.readthedocs.io/en/latest/README_pygad_ReadTheDocs.html#more-about-the-gene-type-parameter
numpy.linspace() doesn't work as documented for customizing the space of values of each gene. This function leads to producing zeros for all genes while populating.
So, it's better to use instead either this notation {"low": 50, "high": 200, "step": 25} or convert numpy.ndarray to list like numpy.linspace().tolist().
gene_space
gene_space = [
# n_estimators
{"low": 50, "high": 200, "step": 25},
# min_samples_split,
{"low": 2, "high": 10, "step": 5},
# min_samples_leaf,
{"low": 1, "high": 10, "step": 5},
# min_impurity_decrease
np.linspace(0, 1, 10).tolist()
]
gene_type
cross_validate = pygad.GA(
gene_space=gene_space,
fitness_func=fitness_function_factory(),
num_generations=100,
num_parents_mating=2,
sol_per_pop=8,
num_genes=len(gene_space),
parent_selection_type='sss',
keep_parents=2,
crossover_type="single_point",
mutation_type="random",
mutation_percent_genes=25,
gene_type=[int, int, int, float]
)
I tested this way
import numpy as np
import pandas as pd
import pygad
from numpy.random import default_rng
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils.random import sample_without_replacement
gene_space = [
# n_estimators
{"low": 50, "high": 200, "step": 25},
# min_samples_split,
{"low": 2, "high": 10, "step": 5},
# min_samples_leaf,
{"low": 1, "high": 10, "step": 5},
# min_impurity_decrease
np.linspace(0, 1, 10).tolist()
]
rng = default_rng()
n = 1000
data = pd.DataFrame({"x_1": rng.standard_normal(n), "x_2": rng.standard_normal(n), "y": rng.integers(0, 2, n)})
def fitness_function_factory(data=data, y_name='y', sample_size=100):
def fitness_function(solution, solution_idx):
model = RandomForestClassifier(
n_estimators=solution[0],
min_samples_split=solution[1],
min_samples_leaf=solution[2],
min_impurity_decrease=solution[3]
)
X = data.drop(columns=[y_name])
y = data[y_name]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.5)
train_idx = sample_without_replacement(n_population=len(X_train),
n_samples=sample_size)
test_idx = sample_without_replacement(n_population=len(X_test),
n_samples=sample_size)
model.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
fitness = model.score(X_test.iloc[test_idx], y_test.iloc[test_idx])
return fitness
return fitness_function
cross_validate = pygad.GA(
gene_space=gene_space,
fitness_func=fitness_function_factory(),
num_generations=100,
num_parents_mating=2,
sol_per_pop=8,
num_genes=len(gene_space),
parent_selection_type='sss',
keep_parents=2,
crossover_type="single_point",
mutation_type="random",
mutation_percent_genes=25,
gene_type=[int, int, int, float]
)
print(cross_validate.best_solution())
(array([75, 2, 1, 0.5555555555555556], dtype=object), 0.5, 3)