I am trying to make XGBoost
work with the hierarchical classifier package available here (repo archived).
I can confirm the module works fine with sklearn's random forest classifier (and other sklearn modules that I checked with). But I cannot get it work with XGBoost
. I understand there's some modification needed for the hierarchical classifier to work, but I cannot figure out this modification.
I give below, a MWE
to reproduce the issue (assuming the library is installed via - pip install sklearn-hierarchical-classification
):
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn_hierarchical_classification.classifier import HierarchicalClassifier
from sklearn_hierarchical_classification.constants import ROOT
from sklearn_hierarchical_classification.metrics import h_fbeta_score, multi_labeled
#from sklearn_hierarchical_classification.tests.fixtures import make_digits_dataset
We want to build the following class hierarchy along with data from the handwritten digits dataset:
<ROOT>
/ \
A B
/ \ / \
1 7 C 9
/ \
3 8
Like so:
def make_digits_dataset(targets=None, as_str=True):
"""Helper function: from sklearn_hierarchical_classification.tests.fixtures module """
X, y = load_digits(return_X_y=True)
if targets:
ix = np.isin(y, targets)
X, y = X[np.where(ix)], y[np.where(ix)]
if as_str:
# Convert targets (classes) to strings
y = y.astype(str)
return X, y
class_hierarchy = {
ROOT: ["A", "B"],
"A": ["1", "7"],
"B": ["C", "9"],
"C": ["3", "8"],
}
So that:
base1 = RandomForestClassifier()
base2 = XGBClassifier()
clf = HierarchicalClassifier(
base_estimator=base1,
class_hierarchy=class_hierarchy,
)
X, y = make_digits_dataset(targets=[1, 7, 3, 8, 9],
as_str=False, )
y = y.astype(str)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=RANDOM_STATE, )
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
with multi_labeled(y_test, y_pred, clf.graph_) as (y_test_, y_pred_, graph_):
h_fbeta = h_fbeta_score(
y_test_, y_pred_, graph_, )
print("h_fbeta_score: ", h_fbeta)
h_fbeta_score: 0.9690011481056257
Works fine. But with XGBClassifier
(base2
) raises the following error:
Traceback (most recent call last):
File "~/hierarchical-classification.py", line 62, in <module>
clf.fit(X_train, y_train)
File "~/venv/lib/python3.10/site-packages/sklearn_hierarchical_classification/classifier.py", line 206, in fit
self._recursive_train_local_classifiers(X, y, node_id=self.root, progress=progress)
File "~/venv/lib/python3.10/site-packages/sklearn_hierarchical_classification/classifier.py", line 384, in _recursive_train_local_classifiers
self._train_local_classifier(X, y, node_id)
File "~/venv/lib/python3.10/site-packages/sklearn_hierarchical_classification/classifier.py", line 453, in _train_local_classifier
clf.fit(X=X_, y=y_)
File "~/venv/lib/python3.10/site-packages/xgboost/core.py", line 620, in inner_f
return func(**kwargs)
File "~/venv/lib/python3.10/site-packages/xgboost/sklearn.py", line 1438, in fit
or not (self.classes_ == expected_classes).all()
AttributeError: 'bool' object has no attribute 'all'
I understand this error got to do with this section of the call to the fit()
method in xgboost.sklearn.py
:
1436 if (
1437 self.classes_.shape != expected_classes.shape
1438 or not (self.classes_ == expected_classes).all()
1439 ):
1440 raise ValueError(
1441 f"Invalid classes inferred from unique values of `y`. "
1442 f"Expected: {expected_classes}, got {self.classes_}"
1443 )
Expected value of y
: [0 1]
, but got ['A' 'B']
(internal nodes).
There must be a way to modify the class sklearn_hierarchical_classification.classifier.HierarchicalClassifier.py
so that it works fine with xgboost
.
What's the fix to this?
Found a walkaround this. Got to create a wrapper around XGBClassifier
, like so:
from sklearn.base import BaseEstimator
class XGBHierarchicalClassifier(BaseEstimator):
def __init__(self, **kwargs):
self.clf = XGBClassifier(**kwargs)
self.label_map = {}
self.inverse_label_map = {}
self.label_counter = 0
def fit(self, X, y):
# Map string labels to numeric values
unique_labels = np.unique(y)
for label in unique_labels:
if label not in self.label_map:
self.label_map[label] = self.label_counter
self.inverse_label_map[self.label_counter] = label
self.label_counter += 1
y_numeric = np.array([self.label_map[label] for label in y])
self.clf.fit(X, y_numeric)
self.classes_ = np.unique(y)
return self
def predict(self, X):
return np.array([self.inverse_label_map[label] for label in np.argmax(self.predict_proba(X), axis=1)])
def predict_proba(self, X):
return self.clf.predict_proba(X)
Usage:
base3 = XGBHierarchicalClassifier()
clf = HierarchicalClassifier(
base_estimator=base3,
class_hierarchy=class_hierarchy,
)
with multi_labeled(y_test, y_pred, clf.graph_) as (y_test_, y_pred_, graph_):
h_fbeta = h_fbeta_score(
y_test_, y_pred_, graph_, )
print("h_fbeta_score: ", h_fbeta)
h_fbeta_score: 0.8501118568232662