Is there a better inbuilt way to do grid search and test multiple models in a single pipeline? Of course the parameters of the models would be different, which made is complicated for me to figure this out. Here is what I did:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
def grid_search():
pipeline1 = Pipeline((
('clf', RandomForestClassifier()),
('vec2', TfidfTransformer())
))
pipeline2 = Pipeline((
('clf', KNeighborsClassifier()),
))
pipeline3 = Pipeline((
('clf', SVC()),
))
pipeline4 = Pipeline((
('clf', MultinomialNB()),
))
parameters1 = {
'clf__n_estimators': [10, 20, 30],
'clf__criterion': ['gini', 'entropy'],
'clf__max_features': [5, 10, 15],
'clf__max_depth': ['auto', 'log2', 'sqrt', None]
}
parameters2 = {
'clf__n_neighbors': [3, 7, 10],
'clf__weights': ['uniform', 'distance']
}
parameters3 = {
'clf__C': [0.01, 0.1, 1.0],
'clf__kernel': ['rbf', 'poly'],
'clf__gamma': [0.01, 0.1, 1.0],
}
parameters4 = {
'clf__alpha': [0.01, 0.1, 1.0]
}
pars = [parameters1, parameters2, parameters3, parameters4]
pips = [pipeline1, pipeline2, pipeline3, pipeline4]
print "starting Gridsearch"
for i in range(len(pars)):
gs = GridSearchCV(pips[i], pars[i], verbose=2, refit=False, n_jobs=-1)
gs = gs.fit(X_train, y_train)
print "finished Gridsearch"
print gs.best_score_
However, this approach is still giving the best model within each classifier, and not comparing between classifiers.
Instead of using Grid Search for hyperparameter selection, you can use the 'hyperopt' library.
Please have a look at section 2.2 of this page. In the above case, you can use an hp.choice
expression to select among the various pipelines and then define the parameter expressions for each one separately.
In your objective function, you need to have a check depending on the pipeline chosen and return the CV score for the selected pipeline and parameters (possibly via cross_val_score).
The trials object at the end of the execution, will indicate the best pipeline and parameters overall.