machine-learningscikit-learnclassificationrandom-forestsupervised-learning

How does SelectFromModel() work from from_model.py?


fsel = ske.ExtraTreesClassifier().fit(X, y)

model = SelectFromModel(fsel, prefit=True)

I am trying to train a data set over the ExtraTreesClassifier How does the function SelectFromModel() decide the importance value and what does it return?


Solution

  • As noted in the documentation for SelectFromModel:

    threshold : string, float, optional default None

    The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.

    In your case threshold is the default value, None, and the mean of the feature_importances_ in your ExtraTreesClassifier will be used as the threshold.

    Example

    from sklearn.datasets import load_iris
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.feature_selection import SelectFromModel
    
    iris = load_iris()
    X, y  = iris.data, iris.target
    clf = ExtraTreesClassifier()
    model = SelectFromModel(clf)
    SelectFromModel(estimator=ExtraTreesClassifier(bootstrap=False, 
                class_weight=None, criterion='gini',
                max_depth=None, max_features='auto', max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
                oob_score=False, random_state=None, verbose=0, warm_start=False),
            norm_order=1, prefit=False, threshold=None)
    model.fit(X, y)
    print(model.threshold_)
    #0.25
    print(model.estimator_.feature_importances_)
    #array([0.09790258, 0.02597852, 0.35586554, 0.52025336])
    print(model.estimator_.feature_importances_.mean())
    #0.25
    

    As you can see the fitted model is an instance of SelectFromModel with ExtraTreesClassifier() as the estimator. The threshold is 0.25, which is also the mean of the feature importances of the fitted estimator. Based on the feature importances and threshold the model would keep only the 3rd and 4th features of the input data (those with an importance greater than the threshold). You can use the transform method of the fitted SelectFromModel() class to select these features from the input data.