numpyscikit-learntf-idf

Scikit-Learn's feature_names_in Method


A number of scikit-learn's classes have a feature_names_in method, which would be a real time saver if I could understand it better. Specifically, assume your X is a nested list of strings [['A', 'B', 'C'], ['A', 'B', 'D']] and your y is a list of labels ['Blue', 'Green']. Now, assume you are doing feature selection using, for example, the selectKbest class in scikit. Assume you choose the chi2 univariate approach and ask for the top 2 features (i.e., k=2) and you get your k_best_object. Now, that k_best_object has a method associated with it called feature_names_in which would be really helpful if it returned the "names" of the top 2 features. The problem is that the documentation says that this method is only available when the features are entirely strings. That would be fine, except for the fact that I haven't been able to get selectKbest (or other scikit classes) to work on strings. Instead, I have only been able to get them to work by converting the X values into a numpy array of floats using TFIDVectorizer (either count or TF-IDF). So, my question is... how would this method ever be used? If it's only viable when all X input values are strings, but the only X it will take is floats, then how does this method ever apply?

enter image description here

To illustrate with code, if you try this:

X_t = [['Land','Building','Cat'],['Land','Building','Dog']]
y_t = ['Blue', 'Green']
chi_select_object_test = SelectKBest(chi2, k=100)
chi_select_object_test.fit(X_t,y_t)

It won't work because the data consists of strings, not numbers. You get this error: ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.Convert your data to numeric values explicitly instead.

But, if you convert X_t to numbers using, for example, TFID Vectorizer(), the class will work:


X_t = ['Land Building Camp','Land Building Dog']
tfidvectorizer_t = TfidfVectorizer(analyzer='word',stop_words= 'english')
X_t = tfidvectorizer_t.fit_transform(X_t)
y_t = ['Blue', 'Green']
chi_select_object_test = SelectKBest(chi2, k=1)
chi_select_object_test.fit(X_t,y_t)

But, then, when you try and access the feature names attribute:

chi_select_object_test.feature_names_in_

You receive the error that: 'SelectKBest' object has no attribute 'feature_names_in_'


Solution

  • I believe what you want to do is pass a pandas.Dataframe to SelectKBest. The Dataframe includes column names that then become the feature names. In the end you can get the best according to the metric you passed by using get_feature_names_out.

    In my silly example I generate a Dataframe with 3 random columns and tell SelectKBest that I want to predict if the 3rd column is bigger than 0.5. Obviously we expect it to give us the 3rd column then.

    import pandas as pd
    import numpy as np
    from sklearn.feature_selection import SelectKBest, chi2
    
    df = pd.DataFrame(np.random.random((100,3)), columns=['a', 'b','c'])
    
    selector = SelectKBest(chi2, k=1)
    selector.fit(df, df['c'] > 0.5)
    
    selector.get_feature_names_out()
    

    and indeed it returns

    array(['c'], dtype=object)
    

    Finally feature_names_in is now set to array(['a', 'b', 'c'], dtype=object) since those are the names of the features we put into the feature selector.