pythonlasso-regression

sklearn ridgeCV versus ElasticNetCV


When setting l1_ratio = 0, the elastic net reduces to Ridge regression. However, I am unable to match the results obtained from sklearn's ridgeCV versus ElasticNetCV. They appear to produce very different optimal alpha values:

import numpy as np
from sklearn.linear_model import ElasticNetCV, RidgeCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np

#data generation
np.random.seed(123)
beta = 0.35
N = 120
p = 30

X = np.random.normal(1, 2, (N, p))
y = np.random.normal(5, size=N) + beta * X[:, 0]
#lambdas to try:
l = np.exp(np.linspace(-2, 8, 80))

ridge1 = RidgeCV(alphas = l, store_cv_values=True).fit(X, y)
MSE_cv = np.mean(ridge1.cv_values_, axis =0)#.shape
y_pred = ridge1.predict(X=X)
MSE = mean_squared_error(y_true=y,y_pred=y_pred)

print(f"best alpha: {np.round(ridge1.alpha_,3)}")
print(f"MSE: {np.round(MSE,3)}")

which yields best alpha: 305.368, MSE: 0.952

While ElasticNetCV ends up with a similar MSE, its penalty parameters seem to be on a different scale (actually agreeing with the R implementation)

ridge2 = ElasticNetCV(cv=10, alphas = l, random_state=0, l1_ratio=0);
ridge2.fit(X, y) 
y_pred = ridge2.predict(X=X)
MSE = mean_squared_error(y_true=y,y_pred=y_pred)

print(f"best alpha: {np.round(ridge2.alpha_,3)}")
print(f"MSE: {np.round(MSE,3)}")

yielding best alpha: 2.192, MSE: 0.934

Are the penalties defined differently ? Does one maybe divide by N ? Or is it due to the very different cross validation strategies ?


Solution

  • Are the penalties defined differently? Does one maybe divide by N?

    Yes, that's the cause of the discrepancy. In elastic-net the regularisation part of the cost function is scaled by the number of samples relative to the error term. This is not the case for RidgeCV. Therefore, to make the cost functions equivalent, we'd need to divide the ElasticNetCV alphas by the size of the training fold.

    Is it due to the very different cross validation strategies?

    RidgeCV uses an internal efficient LOO scheme. We can control for the CV scheme by setting cv=LeaveOneOut() in ElasticNetCV - this means both models would be using LOO. ElasticNetCV would otherwise default to 5-fold CV.

    When using cv=LeaveOneOut() with ElasticNet, the training fold size is n_samples-1, so that's what we need to scale by.

    Unlike RidgeCV, ElasticNetCV doesn't retain the CV scores for each alpha. I added a 'manual' version of ElasticNetCV where I combined ElasticNet with GridSearchCV using LOO, which gave me access to the MSE for each alpha (for comparison with RidgeCV).

    After applying the requisite scaling, the results line up:

    RidgeCV
      best alpha: 305.368 <---
      MSE: 0.952
      Effective dof (using n_samples - 1): 18.21
    
    ElasticNetCV using LOO
      best alpha | scaled: 2.566 | scaled: 305.368 <---
      MSE: 0.953
      Effective dof (using n_samples): 18.16
    
    ElasticNet + GridSearchCV using LOO
      best alpha | scaled: 2.566 | unscaled: 305.368 <---
      MSE: 0.953
      Effective dof (using n_samples): 18.16
    

    enter image description here

    The calculation of effective dof is my best guess at how it's done, referencing the supplied links (see comments in code).

    According to https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator, RidgeCV with LOO CV, unlike other CV estimators, does not refit on the entire dataset (but rather refits on n_samples-1 due to LOO). This is why its effective dof is slightly different from the other estimators that do refit on the entire dataset (n_samples).

    cross-validation estimator An estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters [...] By default, all these estimators, apart from RidgeCV with an LOO-CV, will be refitted on the full training dataset after finding the best combination of hyper-parameters.


    import numpy as np
    from sklearn.linear_model import ElasticNetCV, RidgeCV
    from sklearn.metrics import mean_squared_error
    import matplotlib.pyplot as plt
    import numpy as np
    np.set_printoptions(suppress=True)
    
    #data generation
    np.random.seed(123)
    beta = 0.35
    N = 120
    p = 30
    
    X = np.random.normal(1, 2, (N, p))
    y = np.random.normal(5, size=N) + beta * X[:, 0]
    
    #lambdas to try:
    alphas = np.exp(np.linspace(-2, 8, 80))
    #
    # RidgeCV with internal efficient LOO CV
    #
    ridge1 = RidgeCV(alphas=alphas, store_cv_values=True).fit(X, y)
    MSE_cv = np.mean(ridge1.cv_values_, axis=0)#.shape
    y_pred = ridge1.predict(X=X)
    MSE = mean_squared_error(y_true=y,y_pred=y_pred)
    
    print('RidgeCV')
    print("  best alpha:", round(ridge1.alpha_, 3), '<---')
    print("  MSE:", round(MSE, 3))
    
    #My understanding of how to calculate the effective dof
    # H_ridge = X⋅(X'⋅X +λ⋅I)**-1 ⋅ X'
    #  https://tommasorigon.github.io/datamining/slides/un_C.html#shrinkage-effect-of-ridge-regression-i
    #
    #Effective dof = 1 + tr(H_ridge)
    # https://tommasorigon.github.io/datamining/slides/un_C.html#effective-degrees-of-freedom-ii
    
    #According to https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator
    # RidgeCV with LOO CV, unlike other CV estimators, does not refit on the *entire* dataset
    # (but rather refits on n_samples-1 due to LOO). This is why its effective dof
    # is marginally different from the other estimators below that do refit on the
    # entire dataset (n_samples).
    H_ridge1 = X @ np.linalg.pinv(X.T @ X + ridge1.alpha_ * np.eye(p)) @ X.T
    print('  Effective dof (using n_samples - 1):', 1 + H_ridge1.trace().round(2))
    
    #
    # ElasticNetCV with LOO CV
    #
    from sklearn.model_selection import LeaveOneOut
    
    fold_size = len(X) - 1
    alphas_scaled = alphas / fold_size #for equivalent cost func to RidgeCV
    
    ridge2 = ElasticNetCV(
        cv=LeaveOneOut(), alphas = alphas_scaled,
        l1_ratio=1e-10, random_state=0
    )
    ridge2.fit(X, y)
    y_pred = ridge2.predict(X=X)
    MSE = mean_squared_error(y_true=y, y_pred=y_pred)
    
    print('\nElasticNetCV using LOO')
    print(
        '  best alpha | scaled:', round(ridge2.alpha_, 3),
        '| scaled:', round(ridge2.alpha_ * fold_size, 3), '<---'
    )
    print("  MSE:", round(MSE, 3))
    
    #See H_ridge1 calculation for references
    lambda2 = ridge2.alpha_ * len(X)
    H_ridge2 = X @ np.linalg.pinv(X.T @ X + lambda2 * np.eye(p)) @ X.T
    print('  Effective dof (using n_samples):', 1 + H_ridge2.trace().round(2))
    
    #
    # "Manual" version of ElasticNetCV
    # i.e. ElasticNet + GridSearchCV using LOO
    # Used for getting CV score for each alpha
    #
    from sklearn.linear_model import ElasticNet
    from sklearn.model_selection import GridSearchCV
    
    search = GridSearchCV(
        ElasticNet(l1_ratio=1e-10),
        param_grid=dict(alpha=list( alphas / fold_size )),
        scoring='neg_mean_squared_error',
        cv=LeaveOneOut(),
        n_jobs=-1
    ).fit(X, y)
    
    ridge3 = search.best_estimator_
    y_pred = ridge3.predict(X)
    MSE = mean_squared_error(y, y_pred)
    
    print('\nElasticNet + GridSearchCV using LOO')
    print(
        "  best alpha | scaled:", round(ridge3.alpha, 3),
        "| unscaled:", round(ridge3.alpha * fold_size, 3), '<---'
    )
    print("  MSE:", round(MSE, 3))
    
    #See H_ridge1 calculation for references
    lambda3 = ridge3.alpha * len(X)
    H_ridge3 = X @ np.linalg.pinv(X.T @ X + lambda3 * np.eye(p)) @ X.T
    print('  Effective dof (using n_samples):', 1 + H_ridge3.trace().round(2))
    
    #
    # Plot CV scores vs alpha
    #
    plt.scatter(
        alphas, ridge1.cv_values_.mean(axis=0),
        color='black', marker='s', s=50, label='RidgeCV'
    )
    plt.scatter(
        alphas, -search.cv_results_['mean_test_score'],
        color='gray', marker='.', label=r'ElasticNet(L1$\approx$0) + LOO CV'
    )
    plt.legend(fontsize=10)
    plt.xlabel(r'$\alpha$')
    plt.ylabel('CV MSE')
    plt.gcf().set_size_inches(8, 3)
    
    #Extra formatting
    [plt.gca().spines[s].set_visible(False) for s in ['right', 'top']]
    plt.gca().spines['bottom'].set_bounds(0, 3000)
    plt.gca().spines['left'].set_bounds(1.35, 1.5)