I can't interpret the values in the leaves of a CatBoostRegressor
tree. The fitted model correctly captures the logic of the dataset, but the scale of the values when I graph a tree doesn't match the scale of the actual dataset.
In this example, we predict size
, which has a value around 15-30 depending on the color
and age
of the observation.
import random
import pandas as pd
import numpy as np
from catboost import Pool, CatBoostRegressor
# Create a fake dataset.
n = 1000
random.seed(1)
df = pd.DataFrame([[random.choice(['red', 'blue', 'green', 'yellow']),
random.random() * 100]
for i in range(n)],
columns=['color', 'age'])
df['size'] = np.select([np.logical_and(np.logical_or(df.color == 'red',
df.color == 'blue'),
df.age < 50),
np.logical_or(df.color == 'red',
df.color == 'blue'),
df.age < 50,
True],
[np.random.normal(loc=15, size=n),
np.random.normal(loc=20, size=n),
np.random.normal(loc=25, size=n),
np.random.normal(loc=30, size=n)])
# Fit a CatBoost regressor to the dataset.
pool = Pool(df[['color', 'age']], df['size'],
feature_names=['color', 'age'], cat_features=[0])
m = CatBoostRegressor(n_estimators=10, max_depth=3, one_hot_max_size=4,
random_seed=1)
m.fit(pool)
# Visualize the first regression tree (saves to a pdf). Values in leaf nodes
# are not on the scale of the original dataset.
m.plot_tree(tree_idx=0, pool=pool).render('regression_tree')
The model splits on age
at the right value (about 50), and it correctly learns that red and blue observations are different from green and yellow ones. The values in the leaves are ordered correctly (e.g., red/blue observations under 50 are the smallest), but the scale is completely different.
The predict()
function returns values on the scale of the original dataset.
>>> df['predicted'] = m.predict(df)
>>> df.sample(n=10)
color age size predicted
676 yellow 66.305095 30.113389 30.065519
918 yellow 55.209821 29.944622 29.464825
705 yellow 1.742565 24.209283 24.913988
268 blue 76.749979 20.513211 20.019020
416 blue 59.807800 18.807197 19.949336
326 red 4.621795 14.748898 14.937314
609 yellow 99.165027 28.942243 29.823422
421 green 40.731038 26.078450 24.846742
363 yellow 2.461971 25.506517 24.913988
664 red 5.206448 16.579706 14.937314
I wondered whether there was some kind of simple normalization going on, but that's clearly not the case. For example, a red observation with age < 50 is assigned a value of -3.418 in the tree, which is nowhere near the z-score of the true value (about 15).
>>> (15 - np.mean(df['size'])) / np.std(df['size'])
-1.3476124913754326
This post asks a similar question about XGBoost. The accepted answer explains that the values should all be added to the base_score
parameter; however, if there's an analogous parameter in CatBoost
, I can't find it. (If the parameter goes by a different name in CatBoost
, I don't know what it's called.) Moreover, the values in the CatBoost
tree don't just differ from the original dataset by some constant; the difference between the largest and smallest leaf nodes is about 7, whereas the difference between the largest and smallest values of size
in the original dataset is about 15.
I've looked through the CatBoost
documentation without success. The "Model values" section says that the values for a regression are "A number resulting from applying the model," which suggests to me that they should be on the scale of the original dataset. (This is true of the output of predict()
, so it's not clear to me whether this section applies to the plotted decision trees anyway.)
Search for this function get_scale_and_bias Return the scale and bias of the model.
These values affect the results of applying the model, since the model prediction results are calculated as follows: \sum leaf_values \cdot scale + bias∑leaf_values⋅scale+bias
Application to the example in the question
Here's a slightly different model fit to the same dataset (using the same code as above).
To translate the leaf values to the original data scale, use the scale and bias returned by get_scale_and_bias()
. I extracted the leaves using _get_tree_leaf_values()
; this function returns string representations of the leaves, so we have to do some regex parsing to get the actual values. I also hand-coded the expected value for each leaf, based on the data-generating process above.
# Get the scale and bias from the model.
sb = m.get_scale_and_bias()
# Apply the scale and bias to the leaves of the tree; compare to expected
# values for each leaf.
import re
[{'expected': [15, 25, 25, None, 20, 30, 30, None][i],
'actual': (float(re.sub(r'^val = (-?[0-9]+([.][0-9]+)?).*$', '\\1', leaf))
* sb[0]) + sb[1]}
for i, leaf in enumerate(m._get_tree_leaf_values(0))]
And we see that the predicted values are not perfect, but are at least in the right ballpark.
[{'expected': 15, 'actual': 19.210155044555663},
{'expected': 25, 'actual': 24.067155044555665},
{'expected': 25, 'actual': 24.096155044555665},
{'expected': None, 'actual': 22.624155044555664},
{'expected': 20, 'actual': 21.309155044555663},
{'expected': 30, 'actual': 26.244155044555665},
{'expected': 30, 'actual': 26.249155044555664},
{'expected': None, 'actual': 22.624155044555664}]