machine-learningscikit-learngridsearchcvlasso-regressionscikit-learn-pipeline

LassoCV getting axis -1 is out of bounds for array of dimension 0 and other questions


Good evening to all,

I am trying to implement for the first time LassoCV with sklearn.

My code is as follows:

numeric_features = ['AGE_2019', 'Inhabitants'] categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']

numeric_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='median'))
      ,('scaler', MinMaxScaler()) # Centrage des données       ])

categorical_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='constant',fill_value='missing'))       
      ,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques ])

preprocessor = ColumnTransformer(    transformers=[
    ('numeric', numeric_transformer, numeric_features)    ,('categorical', categorical_transformer, categorical_features) ]) 

# Creation of the pipeline 

lassocv_piped = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LassoCV())
    ])

# Creation of the grid of parameters

dt_params = {'model__alphas': np.array([0.5])
             }

cv_folds = KFold(n_splits=5, shuffle=True, random_state=0)

lassocv_grid_piped = GridSearchCV(lassocv_piped, dt_params, cv=cv_folds, n_jobs=-1, scoring=['neg_mean_squared_error', 'r2'], refit='r2')  
# Fitting our model

lassocv_grid_piped.fit(df_X_train,df_Y_train.values.ravel())

# Getting our metrics and predictions

Y_pred_lassocv = lassocv_grid_piped.predict(df_X_test)

metrics_lassocv = lassocv_grid_piped.cv_results_ best_lassocv_parameters = lassocv_grid_piped.best_params_


print('Best test negatif MSE of the base model : ', max(metrics_lassocv['mean_test_neg_mean_squared_error'])) print('Best test R^2 of the base model : ', max(metrics_lassocv['mean_test_r2'])) print('Best parameters of the base model : ', best_lassocv_parameters)

# Graphique representation

results = pd.DataFrame(dt_params) for k in range(5):
    results = pd.concat([results,
                         pd.DataFrame(lassocv_grid_piped.cv_results_['split'+str(k)+'_test_neg_mean_squared_error'])],axis=1)
                         sns.relplot(data=results.melt('model__alphas',value_name='neg_mean_squared_error'),x='model__alphas',y='neg_mean_squared_error',kind='line')

I am still a novice when it comes to using this model. So, I have some questions about the use of this estimator:

Also, I encounter this error:

AxisError: axis -1 is out of bounds for array of dimension 0

Would you have an idea to solve it?

I wish you a good evening!


Solution

  • After a good night's sleep, I was able to overcome some of my problems.

    Is it useful to use a cv_fold outside the estimator, as I do ?

    After studying the documentation of LassoCV a bit, it seems not. So I could remove cv_fold from my code. Instead, I could use the cv argument of LassoCV.

    Is it useful to set up a GridSearchCV to test the different alpha values?

    I haven't really been able to answer that question yet. It seems that LassoCV does it by itself.

    How is it possible to extract the R^2 from our model ?

    This can be done simply with the function: .score(X,y).

    As for my error message. I was able to get rid of it once I deleted GridSearchCV.

    Here's my final code :

    numeric_features = ['AGE_2019', 'Inhabitants']
    categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']
        
    numeric_transformer = Pipeline(steps=[
           ('imputer', SimpleImputer(strategy='median'))
          ,('scaler', MinMaxScaler()) # Centrage des données      
    ])
    
    categorical_transformer = Pipeline(steps=[
           ('imputer', SimpleImputer(strategy='constant',fill_value='missing'))       
          ,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques
    ])
    
    preprocessor = ColumnTransformer(
       transformers=[
        ('numeric', numeric_transformer, numeric_features)
       ,('categorical', categorical_transformer, categorical_features)
    ]) 
    
    # Creation of the pipeline 
    list_metrics_lassocv = []
    list_best_lassocv_parameters = []
    
    for i in range (1,12) : 
        lassocv_piped = Pipeline([
            ('preprocessor', preprocessor),
            ('model', LassoCV(cv=5, n_alphas=i, random_state=0))
            ])
    
    
    
    # Fitting our model
    
        lassocv_piped.fit(df_X_train,df_Y_train.values.ravel())
    
    # Getting our metrics and predictions
    
        Y_pred_lassocv = lassocv_piped.predict(df_X_test)
    
        metrics_lassocv = lassocv_piped.score(df_X_train,df_Y_train.values.ravel())
        best_lassocv_parameters = lassocv_piped['model'].alpha_
        
        list_metrics_lassocv.append(metrics_lassocv)
        list_best_lassocv_parameters.append(best_lassocv_parameters)
     
    

    Do not hesitate to correct me if you see an impression or an error.