I am trying to replicate a specific decision tree trained by a RandomForestRegressor class, using DecisionTreeRegressor. However, I cannot get the exact results, even with using the exact same bootstrapped samples and the same hyperparameters. Has anyone tried this before and been successful?
See my code below:
# Import packages
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
# Parameters
n_samples = 160
n_features = 20
# Create training data
Xs, y = make_regression(n_samples = n_samples, n_features = n_features, random_state = 0, shuffle = False)
# Create testing Xs
np.random.seed(0)
Xs_test = np.random.standard_normal((40, n_features))
#%%
# Random forest
n_estimators = 10
max_depth = 4
min_samples_leaf = 17
min_samples_split = 3
rf = RandomForestRegressor(random_state = 0, oob_score = False, max_features = None,
n_estimators = n_estimators,
max_depth = max_depth,
min_samples_leaf = min_samples_leaf,
min_samples_split = min_samples_split)
rf.fit(Xs, y)
y_pred = rf.predict(Xs_test)
# Plotting the 1st tree structure
tree = rf.estimators_[0]
plot_tree(tree, filled = True, rounded = True, node_ids = True, fontsize = 16)
plt.figure(figsize=(20, 10)
plt.show()
#%%
# Replicate the 1st tree
# Get the bootstrapped samples
sample_indices = rf.estimators_samples_[0]
bootstrap_Xs = Xs[sample_indices, :]
bootstrap_y = y[sample_indices]
decision_tree = DecisionTreeRegressor(random_state = tree.random_state, max_features = None,
max_depth = max_depth,
min_samples_leaf = min_samples_leaf,
min_samples_split = min_samples_split)
decision_tree.fit(bootstrap_Xs, bootstrap_y)
y_pred_check = decision_tree.predict(Xs_test)
# Plot the replicated decision tree
plot_tree(decision_tree, filled = True, rounded = True, node_ids = True, fontsize = 16)
plt.figure(figsize=(20, 10))
plt.show()
With above, I got different predicted ys, and different tree structures. Can anyone shed some lights? I am curious what additional settings has RandomForestRegressor done on top of the DecisionTreeRegressor.
Turn out (it was probably obvious, but I have never gone very deep in internal working of forests) each sample is used only once, for training (for gini or entropy computation. But not for values and scores) even if it appears several times in the subsample selection.
So, if, before training your tree, you replace
sample_indices = rf.estimators_samples_[0]
by
sample_indices = np.unique(rf.estimators_samples_[0])
You get something closer to what you expected
But, value estimation in random forest is still done with the actual counts (samples that are twice in the selection count twice)
So, to find the same value
(and errors, etc) that in your tree, you have to count the average of y
with your initial bootstrap_y
and bootstrap_Xs
.
value
as in tree
Here is how I compute value
to take into account weight
sample_indices, counts = np.unique(rf.estimators_samples_[0], return_counts=True) # Replacement for my previous replacement, to have weights
Then, for left node (leaf) of the root tree, for example, I can
leftIdx = bootstrap_Xs[:, decision_tree.tree_.feature[0]] <= decision_tree.tree_.threshold[0]
weightedMean = (bootstrap_y[leftIdx]*counts[leftIdx]).sum() / counts[leftIdx].sum()
I there you find the -194.902 of the random forest version
While the -166.925 is simply
bootstrap_y[leftIdx].mean()
(so no weight)
Note that it doesn't matter for this computation whether your use tree
or decision_tree
, since they are used only to find features and thresholds, which are the same.
Note also that in the random forest tree, leftIdx.sum()
(the number of samples that match x[7]≤-0.604
) is tree.tree_.n_node_samples[1]
(1 being the index of the left leaf of the root node here), while counts[leftIdx].sum()
(the weighted sum of the same samples, that is counting k times those who appear k times), is tree.tree_.weighted_n_node_samples[1]
.
But from decision_tree
I can't get that, since I have removed all weights, to reproduce the features and thresholds.
Well, I did it for only one leaf. But I take that you are not really trying to rebuild the entire tree, but just to prove that you could do it
Likewise, you can try to find the same error for the decision tree (bottom in the image) than the one in the RF tree (top)
For the decision tree, it is computed without weights. So error for the same leaf (18882.793) is the mean of squared difference between -166.925 and the 22 y
((bootstrap_y[leftIdx] - bootstrap_y[leftIdx].mean())**2).mean()
To get the error used by the same tree in the random forest (15882.24), you also have to weight each sample by its number of occurrence (so counts[leftIdx]
)
n=counts[leftIdx].sum() # 34
weighted_mean = (bootstrap_y[leftIdx]*counts[leftIdx]).sum() / n # value. Same as before
weighted_mse = (((bootstrap_y[leftIdx] - weighted_mean)**2*counts[leftIdx])).sum()/n```
Bottom line is, in random forest, samples may appear more than once in subsamples. And way to reproduce that with DecisionTreeRegressor
is not straightforward.
What you did is just duplicate the duplicated entries (you have 160 samples, some redundant, in your bootstrap_?
)
Which would work to compute value or mse (applying a weight 3 in a weighted mean, or have 3 copies of the same sample on which you compute a simple mean, is the same) But not to compute features and thresholds, that are done from unique samples.
My answer isn't producing an identical tree. But at least it shows that it is feasible to understand what the decision trees in a forest are, which was I presume your goal.
What I said in comment wasn't, after all, the reason for the difference. But yet, comment is still valid. There are still some arbitraries decision that can be made, even once you have freezed what the training samples of a decision tree is. It is only "almost" deterministic. Because of the ties in impurity.
Or even just because of different possible implementation of the threshold. It is a bit like when you compute a median. Most people often believe that it is deterministic, and that median of [1,2,3,4] is 2.5, no doubt. But strictly speaking, that is A median. 2.2 or 2.8 also is. So, you could end up with different threshold leading to the exact same predictions, but yet different. In the case of the root node here, threshold -0.604 is one possible. But any threshold between -0.670 and -0.538 would have been equivalent. So, that is a minor consideration, but yet, it is another example of what can still be arbitrary in a decision tree, and vary from one implementation to the other, even once all the hyperparameters and training data are chosen.
And that's just the 2 examples that came to my mind, but there are certain many others.