pythondecision-treecatboostcatboostregressor

What is the scale of the leaf values in a CatBoostRegressor tree?


The puzzle

I can't interpret the values in the leaves of a CatBoostRegressor tree. The fitted model correctly captures the logic of the dataset, but the scale of the values when I graph a tree doesn't match the scale of the actual dataset.

In this example, we predict size, which has a value around 15-30 depending on the color and age of the observation.

import random
import pandas as pd
import numpy as np
from catboost import Pool, CatBoostRegressor

# Create a fake dataset.
n = 1000
random.seed(1)
df = pd.DataFrame([[random.choice(['red', 'blue', 'green', 'yellow']),
                    random.random() * 100]
                   for i in range(n)],
                  columns=['color', 'age'])
df['size'] = np.select([np.logical_and(np.logical_or(df.color == 'red',
                                                     df.color == 'blue'),
                                       df.age < 50),
                        np.logical_or(df.color == 'red',
                                      df.color == 'blue'),
                        df.age < 50,
                        True],
                       [np.random.normal(loc=15, size=n),
                        np.random.normal(loc=20, size=n),
                        np.random.normal(loc=25, size=n),
                        np.random.normal(loc=30, size=n)])

# Fit a CatBoost regressor to the dataset.
pool = Pool(df[['color', 'age']], df['size'],
            feature_names=['color', 'age'], cat_features=[0])
m = CatBoostRegressor(n_estimators=10, max_depth=3, one_hot_max_size=4,
                      random_seed=1)
m.fit(pool)

# Visualize the first regression tree (saves to a pdf).  Values in leaf nodes
# are not on the scale of the original dataset.
m.plot_tree(tree_idx=0, pool=pool).render('regression_tree')

enter image description here

The model splits on age at the right value (about 50), and it correctly learns that red and blue observations are different from green and yellow ones. The values in the leaves are ordered correctly (e.g., red/blue observations under 50 are the smallest), but the scale is completely different.

The predict() function returns values on the scale of the original dataset.

>>> df['predicted'] = m.predict(df)
>>> df.sample(n=10)
      color        age       size  predicted
676  yellow  66.305095  30.113389  30.065519
918  yellow  55.209821  29.944622  29.464825
705  yellow   1.742565  24.209283  24.913988
268    blue  76.749979  20.513211  20.019020
416    blue  59.807800  18.807197  19.949336
326     red   4.621795  14.748898  14.937314
609  yellow  99.165027  28.942243  29.823422
421   green  40.731038  26.078450  24.846742
363  yellow   2.461971  25.506517  24.913988
664     red   5.206448  16.579706  14.937314

What I've tried

I wondered whether there was some kind of simple normalization going on, but that's clearly not the case. For example, a red observation with age < 50 is assigned a value of -3.418 in the tree, which is nowhere near the z-score of the true value (about 15).

>>> (15 - np.mean(df['size'])) / np.std(df['size'])
-1.3476124913754326

This post asks a similar question about XGBoost. The accepted answer explains that the values should all be added to the base_score parameter; however, if there's an analogous parameter in CatBoost, I can't find it. (If the parameter goes by a different name in CatBoost, I don't know what it's called.) Moreover, the values in the CatBoost tree don't just differ from the original dataset by some constant; the difference between the largest and smallest leaf nodes is about 7, whereas the difference between the largest and smallest values of size in the original dataset is about 15.

I've looked through the CatBoost documentation without success. The "Model values" section says that the values for a regression are "A number resulting from applying the model," which suggests to me that they should be on the scale of the original dataset. (This is true of the output of predict(), so it's not clear to me whether this section applies to the plotted decision trees anyway.)


Solution

  • Search for this function get_scale_and_bias Return the scale and bias of the model.

    These values affect the results of applying the model, since the model prediction results are calculated as follows: \sum leaf_values \cdot scale + bias∑leaf_values⋅scale+bias

    Application to the example in the question

    Here's a slightly different model fit to the same dataset (using the same code as above).

    enter image description here

    To translate the leaf values to the original data scale, use the scale and bias returned by get_scale_and_bias(). I extracted the leaves using _get_tree_leaf_values(); this function returns string representations of the leaves, so we have to do some regex parsing to get the actual values. I also hand-coded the expected value for each leaf, based on the data-generating process above.

    # Get the scale and bias from the model.
    sb = m.get_scale_and_bias()
    
    # Apply the scale and bias to the leaves of the tree; compare to expected
    # values for each leaf.
    import re
    [{'expected': [15, 25, 25, None, 20, 30, 30, None][i],
      'actual': (float(re.sub(r'^val = (-?[0-9]+([.][0-9]+)?).*$', '\\1', leaf))
                 * sb[0]) + sb[1]}
     for i, leaf in enumerate(m._get_tree_leaf_values(0))]
    

    And we see that the predicted values are not perfect, but are at least in the right ballpark.

    [{'expected': 15, 'actual': 19.210155044555663},
     {'expected': 25, 'actual': 24.067155044555665},
     {'expected': 25, 'actual': 24.096155044555665},
     {'expected': None, 'actual': 22.624155044555664},
     {'expected': 20, 'actual': 21.309155044555663},
     {'expected': 30, 'actual': 26.244155044555665},
     {'expected': 30, 'actual': 26.249155044555664},
     {'expected': None, 'actual': 22.624155044555664}]