Here's my basic code for two-feature classification of the well-known Iris dataset:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from graphviz import Source
iris = load_iris()
iris_limited = iris.data[:, [2, 3]] # This gets only petal length & width.
# I'm using the max depth as a way to avoid overfitting
# and simplify the tree since I'm using it for educational purposes
clf = DecisionTreeClassifier(criterion="gini",
max_depth=3,
random_state=42)
clf.fit(iris_limited, iris.target)
visualization_raw = export_graphviz(clf,
out_file=None,
special_characters=True,
feature_names=["length", "width"],
class_names=iris.target_names,
node_ids=True)
visualization_source = Source(visualization_raw)
visualization_png_bytes = visualization_source.pipe(format='png')
with open('my_file.png', 'wb') as f:
f.write(visualization_png_bytes)
When I checked the visualization of my tree, I found this:
This is a fairly normal tree at first glance, but I noticed something odd about it. Node #6 has 46 samples total, only one of which is in versicolor, so the node is marked as virginica. This seems like a fairly reasonable place to stop. However, for some reason I can't understand, the algorithm decides to split further into nodes #7 and #8. But the odd thing is, the 1 versicolor still in there still gets misclassified, since both the nodes end up having the class of virginica anyway. Why is it doing this? Does it blindly look at only the Gini decrease without looking at whether it makes a difference at all - that seems like odd behaviour to me, and I can't find it documented anywhere.
Is it possible to disable, or is this in fact correct?
From the Decision Trees documentation:
scikit-learn uses an optimized version of the CART algorithm
The CART algorithm is described by Breiman et al. in the book with the same title, that is Classification and Regression Trees. In CART, the authors describe how they experimented with numerous criteria for stopping early while growing the decision tree. Eventually they discovered what is now known as Minimal-Cost-Complexity Pruning (MCCP): rather than using ad-hoc stopping criteria, grow a large tree and then prune it back, using cross-validation to compare various subtrees considered by the algorithm. MCCP is described in detail in chapter 3 of CART, and a brief summary is available in the Decision Tree documentation.
Now that we know that the sklearn.DecisionTreeClassifier
implements the CART algorithm, we know many things. For one, with the default settings, unless the max_depth
parameter is provided, the tree is expanded completely. More precisely, as mentioned in the documentation for DecisionTreeClassifier.max_depth
:
max_depth: int, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
This answers your first question of why node #6 was split: node #6 was not pure, as defined by the Gini Index function, and therefore it was split into node #8, a pure node, and node #7. Following this logic, your next question seems quite reasonable:
But the odd thing is, the 1 versicolor still in there still gets misclassified, since both the nodes end up having the class of virginica anyway. Why is it doing this?
That is, why was Node #7 not expanded? A key detail is that your tree has depth 3 only because you passed max_depth=3
to the constructor. In other words, you explicitly constrained DecisionTreeClassifier
from expanding node #7. If you remove the constraint, you get the following tree from your code:
At first glance, I thought that this would solve the problem. Indeed, almost every node is pure in T', that is except for node #7 (now labeled node #13)! Recently, I had to implement Decision Trees and Random Forests from scratch, and thus I had an intuition about what was going on. My hypothesis was that, although the samples in node #13 have different labels, they actually have identical features. This turned out to be correct:
>>> X = iris_limited
>>> node13 = X[(X[:, 0] > 2.45) & (X[:, 1] > 1.75) & (X[:, 0] <= 4.85)]
>>> node13
array([[4.8, 1.8],
[4.8, 1.8],
[4.8, 1.8]])
This kind of behavior is not exceptional in the case when you perform variable selection on your data: in this instance, you defined iris_limited
to be the subset of the iris
data that excludes the sepal length and sepal width variables. A client of DecisionTree
(such as a RandomForest
object) may explicitly request variable selection by specifying the max_features
parameter: in this case the DecisionTree
will choose random subsets of the variables at every split, which will in effect decorrelate multiple trees T0, T1, ..., Tk that are built from the same dataset.
It is important to note that by default, the sklearn.DecisionTreeClassifier
does not perform MCC pruning. To enable this feature (which you definitely should if you are growing a single tree, rather than a forest or an ensemble) set the hyperparameter ccp_alpha
to some real number, which you can choose using cross-validation. In their (critical) response to the article Tree-Structured Classification Via Generalized Discriminant Analysis by Loh and Vanichsetakul, Breiman and Friedman make the following statement about the efficacy of MCC pruning:
The two ways that the procedure presented in this article (FACT) gains speed over CART is by using an ad hoc top-down stopping rule, and by not implementing surrogate splits. Topdown stopping rules were used in all of the early predecessors of CART (AID, THAID, etc.). They were highly criticized (with good reason) on this point, and this was the principal reason for their lack of acceptance in the statistical community. The optimal complexity tree-pruning algorithm (based on cross-validatory choice) implemented in CART is probably the most important contribution of Breiman et al. (1984) to the evolution of tree-structured methodology. It tends to produce right-sized trees reliably.
In the course of the research that led to CART, almost two years were spent experimenting with different stopping rules. Each stopping rule was tested on hundreds of simulated data sets with different structures. Each new stopping rule failed on some data set. It was not until a very large tree was built and then pruned, using cross-validation to govern the degree of pruning, that we obtained something that worked consistently.
Let us apply MCCP to your tree. sklearn.model_selection
provides extensive facilities for cross-validation. We will use some basic tools to make a reasonable selection for alpha:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import GridSearchCV
# from sklearn.model_selection import train_test_split
from graphviz import Source
iris = load_iris()
iris_limited = iris.data[:, [2, 3]]
# X_train, X_test, y_train, y_test = train_test_split(iris_limited, iris.target, test_size=0.3, random_state=42)
parameters = [
{'ccp_alpha': [1/(2**k) for k in range(5, 23)]}
]
search = GridSearchCV(
DecisionTreeClassifier(criterion="gini", random_state=42),
parameters, scoring='accuracy', cv=10
)
# search.fit(X_train, y_train)
search.fit(iris_limited, iris.target)
>>> search.best_params_
{'ccp_alpha': 0.015625}
Now let us construct the same visualization of the tree as we did before, but now we will use the GridSearchCV.best_estimator_
, which is the DecisionTree
that was chosen by the search over the parameter space for alpha that we specified.
gviz_raw = export_graphviz(
search.best_estimator_,
out_file=None,
special_characters=True,
feature_names=["length", "width"],
class_names=iris.target_names,
node_ids=True
)
with open('iris-ccp.png', 'wb') as f:
f.write(Source(gviz_raw).pipe(format='png'))
We get the following tree:
Notice that we produced a tree T* that is a subtree of both of the preceding trees. For example, the branch down to node #6 is identical to your initial example. I used the same (complete) Iris dataset as you did for the purpose of presentation (normally, we would split the data into a training and test set, and then perform cross validation only on the training samples). This tree performs slightly worse on the Iris data:
>>> clf.score(iris_limited, iris.target)
0.9933333333333333
>>> clf_star.score(iris_limited, iris.target)
0.9733333333333334
Cross-validation is an effective methodology, however, and our tree should generalize better. Furthermore, as per your request, the pruned tree did not expand node #6. You shouldn't believe me, though. Just take a look. Train a DecisionTree
without pruning, as you did before, and another tree T*, created using cross-validation, as we just described. Only this time, train these trees on half of the data:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.5, random_state=42)
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, Y_train)
search = GridSearchCV(
DecisionTreeClassifier(criterion="gini", random_state=42),
parameters, scoring='accuracy', cv=10)
search.fit(X_train, Y_train)
>>> clf.score(X_test, Y_test)
0.9066666666666666
>>> search.best_estimator_.score(X_test, Y_test)
0.9333333333333333