pythonscikit-learnnlpdecision-treecountvectorizer

How to identify feature names from indices in a decision tree using scikit-learn’s CountVectorizer?


I have the following data for training a model to detect whether a sentence is about:

screenshot of data consisting of a text column and label column

I ran the following code to train a DecisionTreeClassifier() model then view the tree visualisation:

import numpy as np
from numpy.random import seed
import random as rn
import os
import pandas as pd
seed_num = 1
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(seed_num)
rn.seed(seed_num)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

dummy_train = pd.read_csv('dummy_train.csv')

tree_clf = tree.DecisionTreeClassifier()

X_train = dummy_train["text"]
y_train = dummy_train["label"]

dt_tree_pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1,1),
                                                 binary=True)),
                     ('tfidf', TfidfTransformer(use_idf=False)),
                      ('clf', DecisionTreeClassifier(random_state=seed_num,
                                                 class_weight={0:1, 1:1})),
                   ])

tree_model_fold_1 = dt_tree_pipe.fit(X_train, y_train)

tree.plot_tree(dt_tree_pipe["clf"])

...resulting in the following tree:

screenshot of decision tree visualisation

The first node checks if x[7] is less than or equal to 0.177. How do I find out which word x[7] represents?

I tried the following code but the words returned in the output ("describing" and "the") don't look correct. I would have thought 'cat' and 'dog' would be the two words used to split the data into the positive and negative class.

vect_from_pipe = dt_tree_pipe["vect"]
words = vect_from_pipe.vocabulary_.keys()
print(list(words)[7])
print(list(words)[5])

screenshot of the words 'describing' and 'the'


Solution

  • In scikit-learn, the term you’re looking for is feature names. These are the inputs before a transformation is applied.

    In your code, you’re accessing the vocabulary_ attribute of CountVectorizer, which returns a dictionary where the keys are the words and the values are the indices. When you convert the keys to a list and access the 7th or 5th element, it doesn’t necessarily correspond to the word at the 7th or 5th index in the feature matrix.

    To get the feature name (word) corresponding to a particular index, you should use the get_feature_names_out() method of CountVectorizer. This method returns a list of feature names ordered by their corresponding indices in the feature matrix.

    Use this code instead:

    vect_from_pipe = dt_tree_pipe["vect"]
    feature_names = vect_from_pipe.get_feature_names_out()
    print(feature_names[7])
    print(feature_names[5])
    

    This will print the words that correspond to the indices 7 and 5 in your feature matrix. The word at index 7 is the one used in the first split of your decision tree. So, in your case, x[7] in the decision tree corresponds to the word feature_names[7] from your CountVectorizer.