I'm working on a classification problem using scikit-learn's RandomForestClassifier. I tried using RandomizedSearchCV for hyperparameter tuning, but the results were worse than when I manually set the parameters based on intuition and trial/error.
Here's a simplified version of my code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
"n_estimators": [100, 200, 300],
"max_depth": [None, 10, 20, 30],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
}
clf = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy')
random_search.fit(X_train, y_train)
In multiple runs, this approach yields models with lower accuracy on my test set than my manually-tuned model.
What are common pitfalls when using RandomizedSearchCV?
How can I ensure reproducibility and robustness of the tuning process?
RandomizedSearchCV
can give worse results than manual tuning due to a few common reasons:
Too few iterations – n_iter=10
may not explore enough parameter combinations.
Poor parameter grid – Your grid might miss optimal values or be too coarse.
Inconsistent random seeds – Different runs can yield different results if random_state
isn’t set.
Improper CV splits – Use StratifiedKFold
for balanced class sampling.
Wrong scoring metric – Make sure scoring
aligns with your real objective (e.g., accuracy
, f1
).