machine-learningscikit-learndecision-tree

How do I find which attributes my tree splits on, when using scikit-learn?


I have been exploring scikit-learn, making decision trees with both entropy and gini splitting criteria, and exploring the differences.

My question, is how can I "open the hood" and find out exactly which attributes the trees are splitting on at each level, along with their associated information values, so I can see where the two criterion make different choices?

So far, I have explored the 9 methods outlined in the documentation. They don't appear to allow access to this information. But surely this information is accessible? I'm envisioning a list or dict that has entries for node and gain.


Solution

  • Directly from the documentation ( http://scikit-learn.org/0.12/modules/tree.html ):

    from io import StringIO
    out = StringIO()
    out = tree.export_graphviz(clf, out_file=out)
    

    StringIO module is no longer supported in Python3, instead import io module.

    There is also the tree_ attribute in your decision tree object, which allows the direct access to the whole structure.

    And you can simply read it

    clf.tree_.children_left #array of left children
    clf.tree_.children_right #array of right children
    clf.tree_.feature #array of nodes splitting feature
    clf.tree_.threshold #array of nodes splitting points
    clf.tree_.value #array of nodes values
    

    for more details look at the source code of export method

    In general you can use the inspect module

    from inspect import getmembers
    print( getmembers( clf.tree_ ) )
    

    to get all the object's elements

    Decision tree visualization from sklearn docs