Considering an XGBoost model with T trees, I'm currently exploring the performance implications of utilizing only the first k trees. In this particular instance, let's denote T as 500 and k as 100. While I acknowledge that for the IRIS dataset, these values of T and k might seem excessive, they serve the purpose of illustration.
I'm aware that one approach involves employing early_stopping=k during the training phase. However, rather than retraining the model, I'm seeking a solution that allows for the choosing of a predetermined number of trees to use them as an xgb model.
In the following example I want to something to accomplish something to achieve < ADD tree to first_100_booster>.
import numpy as np
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define XGBoost parameters
params = {
'objective': 'multi:softmax', # multiclass classification
'num_class': 3, # number of classes in the dataset
'max_depth': 3, # maximum depth of each tree
'n_estimators': 500 # maximum number of trees to grow
}
# Train the XGBoost model
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train)
# Extract the first 100 trees
first_100_trees = model.get_booster().get_dump()[:100]
# Create a new Booster object with the first 100 trees
first_100_booster = xgb.Booster(model.get_xgb_params())
for tree in first_100_trees:
"""
<ADD tree to first_100_booster>
"""
# Test using the first 100 trees
dtest = xgb.DMatrix(X_test)
y_pred_first_100 = first_100_booster.predict(dtest)
accuracy_first_100 = accuracy_score(y_test, y_pred_first_100)
print("Accuracy using the first 100 trees:", accuracy_first_100)
Since v1.4, the .predict()
method for xgboost
scikit-learn estimators supports an argument iteration_range
. This takes a tuple describing a contiguous range of tree indices, so you can use it to achieve the behavior "generate predictions from a subset of trees".
Consider this example with Python 3.11, xgboost==2.0.3
, and scikit-learn==1.4.1
.
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Load dataset
X, y = load_iris(return_X_y=True)
# train multiclass classifier
model = xgb.XGBClassifier(
objective="multi:softmax",
num_class=3,
max_depth=3,
n_estimators=15
)
model.fit(X, y)
# get predictions based just on the first 5 trees
preds_first5_trees = model.predict(X, iteration_range=(0, 5))
accuracy_score(preds_first5_trees, y)
# 0.973
# get predictions based on all trees in model
full_preds = model.predict(X)
accuracy_score(full_preds, y)
# 0.993
With this approach, it isn't necessary to create a new Booster
object to evaluate the performance implications of using different numbers of trees.