pythonpandasfilterscikit-learnfeature-selection

How to get features selected using Boruta in a Pandas Dataframe with headers


I have this boruta code, and I want to generate the results in pandas with columns included

model = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)

# let's initialize Boruta
feat_selector = BorutaPy(
    verbose=2,
    estimator=model,
    n_estimators='auto',
    max_iter=10,  # numero di iterazioni da fare
    random_state=42,
)

# train Boruta
# N.B.: X and y must be numpy arrays
feat_selector.fit(np.array(X), np.array(y))

# print support and ranking for each feature
print("\n------Support and Ranking for each feature------\n")
for i in range(len(feat_selector.support_)):
    if feat_selector.support_[i]:
        print("Passes the test: ", X.columns[i],
              " - Ranking: ", feat_selector.ranking_[i], "✔️")
    else:
        print("Doesn't pass the test: ",
              X.columns[i], " - Ranking: ", feat_selector.ranking_[i], "❌")

# features selected by Boruta
X_filtered = feat_selector.transform(np.array(X))

My selected result is this:

X.columns[feat_selector.support_]
Index(['J80', 'J100', 'J160', 'J200', 'J250'], dtype='object')

X_filtered
array([[12.73363   ,  8.518314  ,  5.2625847 , ...,  0.06733382]])

How do I generate the result in Pandas dataframe with the headers? Now I have up to 25 headers.


Solution

  • Since support_ is a boolean mask, you can index the columns and create a new dataframe.

    X_filtered = pd.DataFrame(
        feat_selector.transform(X.values), 
        columns=X.columns[feat_selector.support_]
    )
    

    Then again, with the latest master version, you can pass a dataframe to transform() and flag return_df=True. So that would look like:

    X_filtered = feat_selector.transform(X, return_df=True)