pythonmachine-learningscikit-learnfeature-extractionscikit-learn-pipeline

Get features names from scikit pipelines


I am working on ML regression problem where I defined a pipeline like below based on a tutorial online.

My code looks like below

pipe1 = Pipeline([('poly', PolynomialFeatures()),
                 ('fit', linear_model.LinearRegression())])
pipe2 = Pipeline([('poly', PolynomialFeatures()),
                 ('fit', linear_model.Lasso())])
pipe3 = Pipeline([('poly', PolynomialFeatures()),
                 ('fit', linear_model.Ridge())])
pipe4 = Pipeline([('poly', PolynomialFeatures()),
                 ('fit', linear_model.TweedieRegressor())])


models3 = {'OLS': pipe1,
           'Lasso': GridSearchCV(pipe2, 
                                 param_grid=lasso_params).fit(X_train,y_train).best_estimator_ ,
           'Ridge': GridSearchCV(pipe3, 
                                 param_grid=ridge_params).fit(X_train,y_train).best_estimator_,
           'Tweedie':GridSearchCV(pipe4, 
                                 param_grid=tweedie_params).fit(X_train,y_train).best_estimator_}
test(models3, df)

While the above code worked fine and gave me the results, how can I get the list of polynomial features that were created?

Or how can I view them in the dataframe?


Solution

  • You can use the transform method to generate the polynomial feature matrix. To do so, you'll first have to access the corresponding step in the pipeline which, in this case, is at the 0th index. Here is how you can get the polynomial features array for pipe2:

    feature_matrix = model3['Lasso'][0].transform(X_train)
    

    Furthermore, if you wish to generate a DataFrame with the feature names, you can do so by using the get_feature_names_out method:

    feature_names = model['Lasso'][0].get_feature_names_out()
    feature_df = pd.DataFrame(feature_matrix, columns=feature_names)