scikit-learndatasetdata-sciencerandom-forest

How to optimise hyperparameterss for RandomForestClassifier in Python for large datasets?


I'm just working on this problem where I thought RandomForestClassifier from scikit-learn would be better solution for a large dataset. Only after trying with it for this, I found it to be not accurate. The model is either overfitting or underperforming, and sometimes the training time goes on forever.

500000 samples and 50 features. My goal si to classify data into 3 categories.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
X = ...  # Features
y = ...  # Labels

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the classifier
rf = RandomForestClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Grid search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_rf = grid_search.best_estimator_

# Predictions and accuracy
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Best Parameters: {best_params}")
print(f"Accuracy: {accuracy}")

I tried many methods like manual hyperparameter tuning, grid search for a systematic approach and randomised search giving me inconsistent results.

I want help in improving all these areas. Thanks in advance.


Solution

  • To optimise for large datasets, you can use techniques like Bayesian optimization or some evolutionary algorithms.

    This can help you explore parameter space without exhaustive grid search.

    Use just the best frameworks like Dask or Spark to handle such large datasets and accelerate training times efficiently.