fsel = ske.ExtraTreesClassifier().fit(X, y)
model = SelectFromModel(fsel, prefit=True)
I am trying to train a data set over the ExtraTreesClassifier How does the function SelectFromModel() decide the importance value and what does it return?
As noted in the documentation for SelectFromModel
:
threshold : string, float, optional default None
The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.
In your case threshold
is the default value, None
, and the mean of the feature_importances_
in your ExtraTreesClassifier will be used as the threshold.
from sklearn.datasets import load_iris
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
clf = ExtraTreesClassifier()
model = SelectFromModel(clf)
SelectFromModel(estimator=ExtraTreesClassifier(bootstrap=False,
class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False),
norm_order=1, prefit=False, threshold=None)
model.fit(X, y)
print(model.threshold_)
#0.25
print(model.estimator_.feature_importances_)
#array([0.09790258, 0.02597852, 0.35586554, 0.52025336])
print(model.estimator_.feature_importances_.mean())
#0.25
As you can see the fitted model
is an instance of SelectFromModel
with ExtraTreesClassifier()
as the estimator. The threshold is 0.25
, which is also the mean of the feature importances of the fitted estimator. Based on the feature importances and threshold the model would keep only the 3rd and 4th features of the input data (those with an importance greater than the threshold). You can use the transform
method of the fitted SelectFromModel()
class to select these features from the input data.