pythonmachine-learningscikit-learnxgboosthierarchical

How to make XGBoost work in a hierarchical classifier


I am trying to make XGBoost work with the hierarchical classifier package available here (repo archived).

I can confirm the module works fine with sklearn's random forest classifier (and other sklearn modules that I checked with). But I cannot get it work with XGBoost. I understand there's some modification needed for the hierarchical classifier to work, but I cannot figure out this modification.

I give below, a MWE to reproduce the issue (assuming the library is installed via - pip install sklearn-hierarchical-classification):

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import  load_digits
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn_hierarchical_classification.classifier import HierarchicalClassifier
from sklearn_hierarchical_classification.constants import ROOT
from sklearn_hierarchical_classification.metrics import h_fbeta_score, multi_labeled
#from sklearn_hierarchical_classification.tests.fixtures import make_digits_dataset

We want to build the following class hierarchy along with data from the handwritten digits dataset:

         <ROOT>
          /   \
         A     B
       /  \   /  \
      1   7  C    9
            / \
           3   8

Like so:

def make_digits_dataset(targets=None, as_str=True):
    """Helper function: from sklearn_hierarchical_classification.tests.fixtures module """
    X, y = load_digits(return_X_y=True)
    if targets:
        ix = np.isin(y, targets)
        X, y = X[np.where(ix)], y[np.where(ix)]

    if as_str:
        # Convert targets (classes) to strings
        y = y.astype(str)

    return X, y

class_hierarchy = {
    ROOT: ["A", "B"],
    "A": ["1", "7"],
    "B": ["C", "9"],
    "C": ["3", "8"],
    }

So that:

base1 = RandomForestClassifier()
base2 = XGBClassifier()

clf = HierarchicalClassifier(
    base_estimator=base1,
    class_hierarchy=class_hierarchy,
    )

X, y = make_digits_dataset(targets=[1, 7, 3, 8, 9],
                            as_str=False, )
y = y.astype(str)

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=RANDOM_STATE, )

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

with multi_labeled(y_test, y_pred, clf.graph_) as (y_test_, y_pred_, graph_):
  h_fbeta = h_fbeta_score(
      y_test_, y_pred_, graph_, )

print("h_fbeta_score: ", h_fbeta)
h_fbeta_score:  0.9690011481056257

Works fine. But with XGBClassifier (base2) raises the following error:

Traceback (most recent call last):
  File "~/hierarchical-classification.py", line 62, in <module>
    clf.fit(X_train, y_train)
  File "~/venv/lib/python3.10/site-packages/sklearn_hierarchical_classification/classifier.py", line 206, in fit
    self._recursive_train_local_classifiers(X, y, node_id=self.root, progress=progress)
  File "~/venv/lib/python3.10/site-packages/sklearn_hierarchical_classification/classifier.py", line 384, in _recursive_train_local_classifiers
    self._train_local_classifier(X, y, node_id)
  File "~/venv/lib/python3.10/site-packages/sklearn_hierarchical_classification/classifier.py", line 453, in _train_local_classifier
    clf.fit(X=X_, y=y_)
  File "~/venv/lib/python3.10/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
  File "~/venv/lib/python3.10/site-packages/xgboost/sklearn.py", line 1438, in fit
    or not (self.classes_ == expected_classes).all()
AttributeError: 'bool' object has no attribute 'all'

I understand this error got to do with this section of the call to the fit() method in xgboost.sklearn.py:

1436            if (
1437                self.classes_.shape != expected_classes.shape
1438                or not (self.classes_ == expected_classes).all()
1439            ):
1440                raise ValueError(
1441                    f"Invalid classes inferred from unique values of `y`.  "
1442                    f"Expected: {expected_classes}, got {self.classes_}"
1443                )

Expected value of y: [0 1], but got ['A' 'B'] (internal nodes). There must be a way to modify the class sklearn_hierarchical_classification.classifier.HierarchicalClassifier.py so that it works fine with xgboost.

What's the fix to this?


Solution

  • Found a walkaround this. Got to create a wrapper around XGBClassifier, like so:

    from sklearn.base import BaseEstimator
    
    class XGBHierarchicalClassifier(BaseEstimator):
        def __init__(self, **kwargs):
            self.clf = XGBClassifier(**kwargs)
            self.label_map = {}
            self.inverse_label_map = {}
            self.label_counter = 0
    
        def fit(self, X, y):
            # Map string labels to numeric values
            unique_labels = np.unique(y)
            for label in unique_labels:
                if label not in self.label_map:
                    self.label_map[label] = self.label_counter
                    self.inverse_label_map[self.label_counter] = label
                    self.label_counter += 1
            
            y_numeric = np.array([self.label_map[label] for label in y])
    
            self.clf.fit(X, y_numeric)
            self.classes_ = np.unique(y)
            return self
    
        def predict(self, X):
            return np.array([self.inverse_label_map[label] for label in np.argmax(self.predict_proba(X), axis=1)])
    
        def predict_proba(self, X):
            return self.clf.predict_proba(X)
    

    Usage:

    base3 = XGBHierarchicalClassifier()
    
    clf = HierarchicalClassifier(
        base_estimator=base3,
        class_hierarchy=class_hierarchy,
        )
    
    with multi_labeled(y_test, y_pred, clf.graph_) as (y_test_, y_pred_, graph_):
      h_fbeta = h_fbeta_score(
          y_test_, y_pred_, graph_, )
    
    print("h_fbeta_score: ", h_fbeta)
    h_fbeta_score:  0.8501118568232662