pythonpython-3.xmachine-learningscikit-learndecision-tree

Why is my decision tree creating a split that doesn't actually divide the samples?


Here's my basic code for two-feature classification of the well-known Iris dataset:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from graphviz import Source

iris = load_iris()
iris_limited = iris.data[:, [2, 3]] # This gets only petal length & width.

# I'm using the max depth as a way to avoid overfitting
# and simplify the tree since I'm using it for educational purposes
clf = DecisionTreeClassifier(criterion="gini",
                             max_depth=3,
                             random_state=42)

clf.fit(iris_limited, iris.target)

visualization_raw = export_graphviz(clf, 
                                    out_file=None,
                                    special_characters=True,
                                    feature_names=["length", "width"],
                                    class_names=iris.target_names,
                                    node_ids=True)

visualization_source = Source(visualization_raw)
visualization_png_bytes = visualization_source.pipe(format='png')
with open('my_file.png', 'wb') as f:
    f.write(visualization_png_bytes)

When I checked the visualization of my tree, I found this:

Tree visualization result

This is a fairly normal tree at first glance, but I noticed something odd about it. Node #6 has 46 samples total, only one of which is in versicolor, so the node is marked as virginica. This seems like a fairly reasonable place to stop. However, for some reason I can't understand, the algorithm decides to split further into nodes #7 and #8. But the odd thing is, the 1 versicolor still in there still gets misclassified, since both the nodes end up having the class of virginica anyway. Why is it doing this? Does it blindly look at only the Gini decrease without looking at whether it makes a difference at all - that seems like odd behaviour to me, and I can't find it documented anywhere.

Is it possible to disable, or is this in fact correct?


Solution

  • From the Decision Trees documentation:

    scikit-learn uses an optimized version of the CART algorithm

    The CART algorithm is described by Breiman et al. in the book with the same title, that is Classification and Regression Trees. In CART, the authors describe how they experimented with numerous criteria for stopping early while growing the decision tree. Eventually they discovered what is now known as Minimal-Cost-Complexity Pruning (MCCP): rather than using ad-hoc stopping criteria, grow a large tree and then prune it back, using cross-validation to compare various subtrees considered by the algorithm. MCCP is described in detail in chapter 3 of CART, and a brief summary is available in the Decision Tree documentation.

    Now that we know that the sklearn.DecisionTreeClassifier implements the CART algorithm, we know many things. For one, with the default settings, unless the max_depth parameter is provided, the tree is expanded completely. More precisely, as mentioned in the documentation for DecisionTreeClassifier.max_depth:

    max_depth: int, default=None

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

    This answers your first question of why node #6 was split: node #6 was not pure, as defined by the Gini Index function, and therefore it was split into node #8, a pure node, and node #7. Following this logic, your next question seems quite reasonable:

    But the odd thing is, the 1 versicolor still in there still gets misclassified, since both the nodes end up having the class of virginica anyway. Why is it doing this?

    That is, why was Node #7 not expanded? A key detail is that your tree has depth 3 only because you passed max_depth=3 to the constructor. In other words, you explicitly constrained DecisionTreeClassifier from expanding node #7. If you remove the constraint, you get the following tree from your code:

    enter image description here

    At first glance, I thought that this would solve the problem. Indeed, almost every node is pure in T', that is except for node #7 (now labeled node #13)! Recently, I had to implement Decision Trees and Random Forests from scratch, and thus I had an intuition about what was going on. My hypothesis was that, although the samples in node #13 have different labels, they actually have identical features. This turned out to be correct:

    
    >>> X = iris_limited 
    >>> node13 = X[(X[:, 0] > 2.45) & (X[:, 1] > 1.75) & (X[:, 0] <= 4.85)]
    >>> node13
    array([[4.8, 1.8],
           [4.8, 1.8],
           [4.8, 1.8]])
    

    This kind of behavior is not exceptional in the case when you perform variable selection on your data: in this instance, you defined iris_limited to be the subset of the irisdata that excludes the sepal length and sepal width variables. A client of DecisionTree (such as a RandomForest object) may explicitly request variable selection by specifying the max_features parameter: in this case the DecisionTree will choose random subsets of the variables at every split, which will in effect decorrelate multiple trees T0, T1, ..., Tk that are built from the same dataset.

    It is important to note that by default, the sklearn.DecisionTreeClassifier does not perform MCC pruning. To enable this feature (which you definitely should if you are growing a single tree, rather than a forest or an ensemble) set the hyperparameter ccp_alpha to some real number, which you can choose using cross-validation. In their (critical) response to the article Tree-Structured Classification Via Generalized Discriminant Analysis by Loh and Vanichsetakul, Breiman and Friedman make the following statement about the efficacy of MCC pruning:

    The two ways that the procedure presented in this article (FACT) gains speed over CART is by using an ad hoc top-down stopping rule, and by not implementing surrogate splits. Topdown stopping rules were used in all of the early predecessors of CART (AID, THAID, etc.). They were highly criticized (with good reason) on this point, and this was the principal reason for their lack of acceptance in the statistical community. The optimal complexity tree-pruning algorithm (based on cross-validatory choice) implemented in CART is probably the most important contribution of Breiman et al. (1984) to the evolution of tree-structured methodology. It tends to produce right-sized trees reliably.

    In the course of the research that led to CART, almost two years were spent experimenting with different stopping rules. Each stopping rule was tested on hundreds of simulated data sets with different structures. Each new stopping rule failed on some data set. It was not until a very large tree was built and then pruned, using cross-validation to govern the degree of pruning, that we obtained something that worked consistently.

    Let us apply MCCP to your tree. sklearn.model_selection provides extensive facilities for cross-validation. We will use some basic tools to make a reasonable selection for alpha:

    from sklearn.datasets import load_iris
    from sklearn.tree import DecisionTreeClassifier, export_graphviz
    from sklearn.model_selection import GridSearchCV
    # from sklearn.model_selection import train_test_split
    from graphviz import Source
    
    iris = load_iris()
    iris_limited = iris.data[:, [2, 3]]
    # X_train, X_test, y_train, y_test = train_test_split(iris_limited, iris.target, test_size=0.3, random_state=42)
    
    parameters = [
        {'ccp_alpha': [1/(2**k) for k in range(5, 23)]}
    ]
    search = GridSearchCV(
        DecisionTreeClassifier(criterion="gini", random_state=42),
        parameters, scoring='accuracy', cv=10
    )
    # search.fit(X_train, y_train)
    search.fit(iris_limited, iris.target)
    
    >>> search.best_params_
    
    {'ccp_alpha': 0.015625}
    

    Now let us construct the same visualization of the tree as we did before, but now we will use the GridSearchCV.best_estimator_, which is the DecisionTree that was chosen by the search over the parameter space for alpha that we specified.

    gviz_raw = export_graphviz(
        search.best_estimator_,
        out_file=None,
        special_characters=True,
        feature_names=["length", "width"],
        class_names=iris.target_names,
        node_ids=True
    )
    with open('iris-ccp.png', 'wb') as f:
        f.write(Source(gviz_raw).pipe(format='png'))
    

    We get the following tree:

    DecisionTree for the Iris dataset constructed with Minimum-Cost-Complexity Pruning

    Notice that we produced a tree T* that is a subtree of both of the preceding trees. For example, the branch down to node #6 is identical to your initial example. I used the same (complete) Iris dataset as you did for the purpose of presentation (normally, we would split the data into a training and test set, and then perform cross validation only on the training samples). This tree performs slightly worse on the Iris data:

    >>> clf.score(iris_limited, iris.target)
    0.9933333333333333
    
    >>> clf_star.score(iris_limited, iris.target)
    0.9733333333333334
    

    Cross-validation is an effective methodology, however, and our tree should generalize better. Furthermore, as per your request, the pruned tree did not expand node #6. You shouldn't believe me, though. Just take a look. Train a DecisionTree without pruning, as you did before, and another tree T*, created using cross-validation, as we just described. Only this time, train these trees on half of the data:

    from sklearn.model_selection import train_test_split
    X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.5, random_state=42)
    
    clf = DecisionTreeClassifier(criterion="gini", random_state=42)
    clf.fit(X_train, Y_train)
    search = GridSearchCV(
        DecisionTreeClassifier(criterion="gini", random_state=42),
        parameters, scoring='accuracy', cv=10)
    search.fit(X_train, Y_train)
    
    >>> clf.score(X_test, Y_test)
    0.9066666666666666
    
    >>> search.best_estimator_.score(X_test, Y_test)
    0.9333333333333333