xgboostxgbclassifierxgbregressor

How to select a subset of trees from a pretrained XGBoost model?


Considering an XGBoost model with T trees, I'm currently exploring the performance implications of utilizing only the first k trees. In this particular instance, let's denote T as 500 and k as 100. While I acknowledge that for the IRIS dataset, these values of T and k might seem excessive, they serve the purpose of illustration.

I'm aware that one approach involves employing early_stopping=k during the training phase. However, rather than retraining the model, I'm seeking a solution that allows for the choosing of a predetermined number of trees to use them as an xgb model.

In the following example I want to something to accomplish something to achieve < ADD tree to first_100_booster>.

import numpy as np
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define XGBoost parameters
params = {
    'objective': 'multi:softmax',  # multiclass classification
    'num_class': 3,  # number of classes in the dataset
    'max_depth': 3,  # maximum depth of each tree
    'n_estimators': 500  # maximum number of trees to grow
}

# Train the XGBoost model
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train)

# Extract the first 100 trees
first_100_trees = model.get_booster().get_dump()[:100]

# Create a new Booster object with the first 100 trees
first_100_booster = xgb.Booster(model.get_xgb_params())
for tree in first_100_trees:
    """
    <ADD tree to first_100_booster>
    """

# Test using the first 100 trees
dtest = xgb.DMatrix(X_test)
y_pred_first_100 = first_100_booster.predict(dtest)
accuracy_first_100 = accuracy_score(y_test, y_pred_first_100)
print("Accuracy using the first 100 trees:", accuracy_first_100)


Solution

  • Since v1.4, the .predict() method for xgboost scikit-learn estimators supports an argument iteration_range. This takes a tuple describing a contiguous range of tree indices, so you can use it to achieve the behavior "generate predictions from a subset of trees".

    Consider this example with Python 3.11, xgboost==2.0.3, and scikit-learn==1.4.1.

    import xgboost as xgb
    from sklearn.datasets import load_iris
    from sklearn.metrics import accuracy_score
    
    # Load dataset
    X, y = load_iris(return_X_y=True)
    
    # train multiclass classifier
    model = xgb.XGBClassifier(
        objective="multi:softmax",
        num_class=3,
        max_depth=3,
        n_estimators=15
    )
    model.fit(X, y)
    
    # get predictions based just on the first 5 trees
    preds_first5_trees = model.predict(X, iteration_range=(0, 5))
    accuracy_score(preds_first5_trees, y)
    # 0.973
    
    # get predictions based on all trees in model
    full_preds = model.predict(X)
    accuracy_score(full_preds, y)
    # 0.993
    

    With this approach, it isn't necessary to create a new Booster object to evaluate the performance implications of using different numbers of trees.