python-2.7scikit-learnvectorizationdecision-treebinary-decision-diagram

Get back the original feature names after vectorization for decision tree table in Python-Scikit Learn?


I am using some name and other features to predict y (binary classes). Name features are substrings of the name. I am using Python Scikit-learn.

Here is a small portion of X:

[{'substring=ry': True, 'substring=lo': True, 'substring=oui': True, 'firstLetter-firstName': u'm', 'substring=mar': True, 'avg-length': 5.0, 'substring=bl': True, 'lastLetter-lastName': u'n', 'substring=mary': True, 'substring=lou': True, 'metaphone=MR': True, 'location': u'richmond & wolfe, quebec, canada', 'substring=ma': True, 'substring=ui': True, 'substring=in': True, 'substring=ary': True, 'substring=loui': True, 'firstLetter-lastName': u'b', 'lastLetter-firstName': u'y', 'first-name': u'mary', 'substring=ou': True, 'last-name': u'blouin', 'substring=blo': True, 'substring=uin': True, 'metaphone=PLN': True, 'substring=ar': True, 'name-entity': 2, 'substring=blou': True, 'substring=ouin': True}]

Then I used dictvectorization to vectorize X, into this...

  (0, 0)    5.0
  (0, 6798) 1.0
  (0, 9944) 1.0
  (0, 9961) 1.0
  (0, 11454)    1.0
  (0, 28287)    1.0
  (0, 28307)    1.0
  (0, 28483)    1.0
  (0, 33376)    1.0
  (0, 34053)    1.0
  (0, 36901)    2.0
  (0, 39167)    1.0
  (0, 39452)    1.0
  (0, 40797)    1.0
  (0, 40843)    1.0
  (0, 40853)    1.0
  (0, 51489)    1.0
  (0, 54903)    1.0
  (0, 55050)    1.0
  (0, 55058)    1.0
  (0, 55680)    1.0
  (0, 55835)    1.0
  (0, 55856)    1.0
  (0, 60698)    1.0
  (0, 60752)    1.0
  (0, 60759)    1.0
  (0, 64391)    1.0
  (0, 68278)    1.0
  (0, 68318)    1.0

The problem is I am completely losing track of what the new X represents. And since I need to come up with a decision tree graph, I am only getting these results which I can't interpret.

digraph Tree {
0 [label="X[0] <= 4.5000\ngini = 0.5\nsamples = 25000", shape="box"] ;
1 [label="X[39167] <= 0.5000\ngini = 0.0734231704267\nsamples = 891", shape="box"] ;
0 -> 1 ;
2 [label="X[36901] <= 2.5000\ngini = 0.0575468244736\nsamples = 702", shape="box"] ;
1 -> 2 ;
3 [label="X[58147] <= 0.5000\ngini = 0.0359355212331\nsamples = 442", shape="box"] ;
2 -> 3 ;
4 [label="X[9977] <= 0.5000\ngini = 0.0316694756485\nsamples = 396", shape="box"] ;
3 -> 4 ;
5 [label="X[29713] <= 0.5000\ngini = 0.0275525222406\nsamples = 352", shape="box"] ;
4 -> 5 ;
6 [label="X[9788] <= 0.5000\ngini = 0.0244412457957\nsamples = 319", shape="box"] ;
5 -> 6 ;
7 [label="X[46929] <= 0.5000\ngini = 0.0226406785428\nsamples = 300", shape="box"] ;
6 -> 7 ;
8 [label="X[45465] <= 0.5000\ngini = 0.0209286264458\nsamples = 282", shape="box"] ;
7 -> 8 ;
9 [label="X[45718] <= 0.5000\ngini = 0.0194016759597\nsamples = 266", shape="box"] ;
8 -> 9 ;
10 [label="X[28311] <= 0.5000\ngini = 0.0178698827564\nsamples = 250", shape="box"] ;
9 -> 10 ;...

My python codes:

from sklearn.feature_extraction import DictVectorizer
from sklearn import tree
classifierUsed2 = tree.DecisionTreeClassifier(class_weight="auto")
dv = DictVectorizer()

newX = dv.fit_transform(all_dict)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
classifierUsed2.fit(X_train, y_train)
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)
tree.export_graphviz(classifierUsed2, out_file='graph.dot')     

Solution

  • So to incorporate the additional information of the feature names, we first need to take a look at what newX is. According to the docs (and the print output you show in your example), newX is a sparse matrix, with n rows and d columns, where n is the number of samples, and d is the number of unique features. Each of the columns is identified with an integer that maps back to a feature name that was found in the original data. So we know that we want to find that mapping from indices to features in the original data, and that can use get_feature_names() to help.

    After taking a look at the documentation for graphviz, I found that there's an option for setting the feature_names, simply called feature_names. So all we need to do is include that parameter when writing the graph to file. I'll spell this out with an example:

    from sklearn.feature_extraction import DictVectorizer
    from sklearn import tree
    from sklearn import cross_validation
    classifierUsed2 = tree.DecisionTreeClassifier(class_weight="auto")
    dv = DictVectorizer()
    

    Here we just define everything like you do in your example (I had to add an import statement)

    all_dict = [ {'dog':1, 'cat':1, 'mouse':0, 'elephant':1, 'tiger':1},
             {'dog':0, 'cat':1, 'mouse':0, 'elephant':0, 'tiger':0},
             {'dog':0, 'cat':1, 'mouse':1, 'elephant':1, 'tiger':0},
             {'dog':0, 'cat':1, 'mouse':1, 'elephant':1, 'tiger':1},
             {'dog':0, 'cat':0, 'mouse':1, 'elephant':1, 'tiger':0},
             {'dog':1, 'cat':1, 'mouse':0, 'elephant':0, 'tiger':1}]
    y = [1,0,0,0,1,1]
    testTrainSplit=2
    

    Here's some sample data and variable initializations

    newX = dv.fit_transform(all_dict)
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
    classifierUsed2.fit(X_train, y_train)
    y_train_predictions = classifierUsed2.predict(X_train)
    y_test_predictions = classifierUsed2.predict(X_test)
    tree.export_graphviz(classifierUsed2, feature_names=dv.get_feature_names(), out_file='graph.dot')
    

    And the critical step is the new parameter feature_names=dv.get_feature_names(). Don't be afraid to look around the documentation for these libraries and any functions you want to call within them, as they can be very valuable resources!