I'm new to machine learning and I'm trying to predict the topic of an article given a labeled datasets that each contains all the words in one article. There are 11 different topics total and each article only has single topic. I have built a process pipeline:
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(XGBClassifier(objective="multi:softmax", num_class=11), n_jobs=-1)),
])
I'm trying to implement a GridsearchCV to find the best hyperparameters:
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
'tfidf__use_idf': (True, False)}
gs_clf_svm = GridSearchCV(classifier, parameters, n_jobs=-1, cv=10, scoring='f1_micro')
gs_clf_svm = gs_clf_svm.fit(X, Y)
This works fine, however, how do I tune the hyperparameters of XGBClassifier? I have tried using the notation:
parameters = {'clf__learning_rate': [0.1, 0.01, 0.001]}
It doesn't work because GridSearchCV is looking for the hyperparameters of OneVsRestClassifier. How to actually tune the hyperparameters of XGBClassifier? Also, what hyperparameters are you suggesting worth tuning for my problem?
As is, the pipeline looks for a parameter learning_rate
in OneVsRestClassifier, can't find one (unsurprisingly, since the module does not have such a parameter), and raises an error. Since you actually want the parameter learning_rate
of XGBClassifier, you should go a level deeper, i.e.:
parameters = {'clf__estimator__learning_rate': [0.1, 0.01, 0.001]}