python-2.7 scikit-learn vectorization decision-tree binary-decision-diagram

Get back the original feature names after vectorization for decision tree table in Python-Scikit Learn?

I am using some name and other features to predict y (binary classes). Name features are substrings of the name. I am using Python Scikit-learn.

Here is a small portion of X:

[{'substring=ry': True, 'substring=lo': True, 'substring=oui': True, 'firstLetter-firstName': u'm', 'substring=mar': True, 'avg-length': 5.0, 'substring=bl': True, 'lastLetter-lastName': u'n', 'substring=mary': True, 'substring=lou': True, 'metaphone=MR': True, 'location': u'richmond & wolfe, quebec, canada', 'substring=ma': True, 'substring=ui': True, 'substring=in': True, 'substring=ary': True, 'substring=loui': True, 'firstLetter-lastName': u'b', 'lastLetter-firstName': u'y', 'first-name': u'mary', 'substring=ou': True, 'last-name': u'blouin', 'substring=blo': True, 'substring=uin': True, 'metaphone=PLN': True, 'substring=ar': True, 'name-entity': 2, 'substring=blou': True, 'substring=ouin': True}]

Then I used dictvectorization to vectorize X, into this...

  (0, 0)    5.0
  (0, 6798) 1.0
  (0, 9944) 1.0
  (0, 9961) 1.0
  (0, 11454)    1.0
  (0, 28287)    1.0
  (0, 28307)    1.0
  (0, 28483)    1.0
  (0, 33376)    1.0
  (0, 34053)    1.0
  (0, 36901)    2.0
  (0, 39167)    1.0
  (0, 39452)    1.0
  (0, 40797)    1.0
  (0, 40843)    1.0
  (0, 40853)    1.0
  (0, 51489)    1.0
  (0, 54903)    1.0
  (0, 55050)    1.0
  (0, 55058)    1.0
  (0, 55680)    1.0
  (0, 55835)    1.0
  (0, 55856)    1.0
  (0, 60698)    1.0
  (0, 60752)    1.0
  (0, 60759)    1.0
  (0, 64391)    1.0
  (0, 68278)    1.0
  (0, 68318)    1.0

The problem is I am completely losing track of what the new X represents. And since I need to come up with a decision tree graph, I am only getting these results which I can't interpret.

digraph Tree {
0 [label="X[0] <= 4.5000\ngini = 0.5\nsamples = 25000", shape="box"] ;
1 [label="X[39167] <= 0.5000\ngini = 0.0734231704267\nsamples = 891", shape="box"] ;
0 -> 1 ;
2 [label="X[36901] <= 2.5000\ngini = 0.0575468244736\nsamples = 702", shape="box"] ;
1 -> 2 ;
3 [label="X[58147] <= 0.5000\ngini = 0.0359355212331\nsamples = 442", shape="box"] ;
2 -> 3 ;
4 [label="X[9977] <= 0.5000\ngini = 0.0316694756485\nsamples = 396", shape="box"] ;
3 -> 4 ;
5 [label="X[29713] <= 0.5000\ngini = 0.0275525222406\nsamples = 352", shape="box"] ;
4 -> 5 ;
6 [label="X[9788] <= 0.5000\ngini = 0.0244412457957\nsamples = 319", shape="box"] ;
5 -> 6 ;
7 [label="X[46929] <= 0.5000\ngini = 0.0226406785428\nsamples = 300", shape="box"] ;
6 -> 7 ;
8 [label="X[45465] <= 0.5000\ngini = 0.0209286264458\nsamples = 282", shape="box"] ;
7 -> 8 ;
9 [label="X[45718] <= 0.5000\ngini = 0.0194016759597\nsamples = 266", shape="box"] ;
8 -> 9 ;
10 [label="X[28311] <= 0.5000\ngini = 0.0178698827564\nsamples = 250", shape="box"] ;
9 -> 10 ;...

My python codes:

from sklearn.feature_extraction import DictVectorizer
from sklearn import tree
classifierUsed2 = tree.DecisionTreeClassifier(class_weight="auto")
dv = DictVectorizer()

newX = dv.fit_transform(all_dict)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
classifierUsed2.fit(X_train, y_train)
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)
tree.export_graphviz(classifierUsed2, out_file='graph.dot')

Solution

So to incorporate the additional information of the feature names, we first need to take a look at what newX is. According to the docs (and the print output you show in your example), newX is a sparse matrix, with n rows and d columns, where n is the number of samples, and d is the number of unique features. Each of the columns is identified with an integer that maps back to a feature name that was found in the original data. So we know that we want to find that mapping from indices to features in the original data, and that can use get_feature_names() to help.

After taking a look at the documentation for graphviz, I found that there's an option for setting the feature_names, simply called feature_names. So all we need to do is include that parameter when writing the graph to file. I'll spell this out with an example:

from sklearn.feature_extraction import DictVectorizer
from sklearn import tree
from sklearn import cross_validation
classifierUsed2 = tree.DecisionTreeClassifier(class_weight="auto")
dv = DictVectorizer()

Here we just define everything like you do in your example (I had to add an import statement)

all_dict = [ {'dog':1, 'cat':1, 'mouse':0, 'elephant':1, 'tiger':1},
         {'dog':0, 'cat':1, 'mouse':0, 'elephant':0, 'tiger':0},
         {'dog':0, 'cat':1, 'mouse':1, 'elephant':1, 'tiger':0},
         {'dog':0, 'cat':1, 'mouse':1, 'elephant':1, 'tiger':1},
         {'dog':0, 'cat':0, 'mouse':1, 'elephant':1, 'tiger':0},
         {'dog':1, 'cat':1, 'mouse':0, 'elephant':0, 'tiger':1}]
y = [1,0,0,0,1,1]
testTrainSplit=2

Here's some sample data and variable initializations

newX = dv.fit_transform(all_dict)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
classifierUsed2.fit(X_train, y_train)
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)
tree.export_graphviz(classifierUsed2, feature_names=dv.get_feature_names(), out_file='graph.dot')

And the critical step is the new parameter feature_names=dv.get_feature_names(). Don't be afraid to look around the documentation for these libraries and any functions you want to call within them, as they can be very valuable resources!