sklearn ridgeCV versus ElasticNetCV

When setting l1_ratio = 0, the elastic net reduces to Ridge regression. However, I am unable to match the results obtained from sklearn's ridgeCV versus ElasticNetCV. They appear to produce very different optimal alpha values:

import numpy as np
from sklearn.linear_model import ElasticNetCV, RidgeCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np

#data generation
np.random.seed(123)
beta = 0.35
N = 120
p = 30

X = np.random.normal(1, 2, (N, p))
y = np.random.normal(5, size=N) + beta * X[:, 0]
#lambdas to try:
l = np.exp(np.linspace(-2, 8, 80))

ridge1 = RidgeCV(alphas = l, store_cv_values=True).fit(X, y)
MSE_cv = np.mean(ridge1.cv_values_, axis =0)#.shape
y_pred = ridge1.predict(X=X)
MSE = mean_squared_error(y_true=y,y_pred=y_pred)

print(f"best alpha: {np.round(ridge1.alpha_,3)}")
print(f"MSE: {np.round(MSE,3)}")

which yields best alpha: 305.368, MSE: 0.952

While ElasticNetCV ends up with a similar MSE, its penalty parameters seem to be on a different scale (actually agreeing with the R implementation)

ridge2 = ElasticNetCV(cv=10, alphas = l, random_state=0, l1_ratio=0);
ridge2.fit(X, y) 
y_pred = ridge2.predict(X=X)
MSE = mean_squared_error(y_true=y,y_pred=y_pred)

print(f"best alpha: {np.round(ridge2.alpha_,3)}")
print(f"MSE: {np.round(MSE,3)}")

yielding best alpha: 2.192, MSE: 0.934

Are the penalties defined differently ? Does one maybe divide by N ? Or is it due to the very different cross validation strategies ?

Solution

Are the penalties defined differently? Does one maybe divide by N?

Yes, that's the cause of the discrepancy. In elastic-net the regularisation part of the cost function is scaled by the number of samples relative to the error term. This is not the case for RidgeCV. Therefore, to make the cost functions equivalent, we'd need to divide the ElasticNetCV alphas by the size of the training fold.

Is it due to the very different cross validation strategies?

RidgeCV uses an internal efficient LOO scheme. We can control for the CV scheme by setting cv=LeaveOneOut() in ElasticNetCV - this means both models would be using LOO. ElasticNetCV would otherwise default to 5-fold CV.

When using cv=LeaveOneOut() with ElasticNet, the training fold size is n_samples-1, so that's what we need to scale by.

Unlike RidgeCV, ElasticNetCV doesn't retain the CV scores for each alpha. I added a 'manual' version of ElasticNetCV where I combined ElasticNet with GridSearchCV using LOO, which gave me access to the MSE for each alpha (for comparison with RidgeCV).

After applying the requisite scaling, the results line up:

RidgeCV
  best alpha: 305.368 <---
  MSE: 0.952
  Effective dof (using n_samples - 1): 18.21

ElasticNetCV using LOO
  best alpha | scaled: 2.566 | scaled: 305.368 <---
  MSE: 0.953
  Effective dof (using n_samples): 18.16

ElasticNet + GridSearchCV using LOO
  best alpha | scaled: 2.566 | unscaled: 305.368 <---
  MSE: 0.953
  Effective dof (using n_samples): 18.16

The calculation of effective dof is my best guess at how it's done, referencing the supplied links (see comments in code).

According to https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator, RidgeCV with LOO CV, unlike other CV estimators, does not refit on the entire dataset (but rather refits on n_samples-1 due to LOO). This is why its effective dof is slightly different from the other estimators that do refit on the entire dataset (n_samples).

cross-validation estimator An estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters [...] By default, all these estimators, apart from RidgeCV with an LOO-CV, will be refitted on the full training dataset after finding the best combination of hyper-parameters.

import numpy as np
from sklearn.linear_model import ElasticNetCV, RidgeCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(suppress=True)

#data generation
np.random.seed(123)
beta = 0.35
N = 120
p = 30

X = np.random.normal(1, 2, (N, p))
y = np.random.normal(5, size=N) + beta * X[:, 0]

#lambdas to try:
alphas = np.exp(np.linspace(-2, 8, 80))
#
# RidgeCV with internal efficient LOO CV
#
ridge1 = RidgeCV(alphas=alphas, store_cv_values=True).fit(X, y)
MSE_cv = np.mean(ridge1.cv_values_, axis=0)#.shape
y_pred = ridge1.predict(X=X)
MSE = mean_squared_error(y_true=y,y_pred=y_pred)

print('RidgeCV')
print("  best alpha:", round(ridge1.alpha_, 3), '<---')
print("  MSE:", round(MSE, 3))

#My understanding of how to calculate the effective dof
# H_ridge = X⋅(X'⋅X +λ⋅I)**-1 ⋅ X'
#  https://tommasorigon.github.io/datamining/slides/un_C.html#shrinkage-effect-of-ridge-regression-i
#
#Effective dof = 1 + tr(H_ridge)
# https://tommasorigon.github.io/datamining/slides/un_C.html#effective-degrees-of-freedom-ii

#According to https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator
# RidgeCV with LOO CV, unlike other CV estimators, does not refit on the *entire* dataset
# (but rather refits on n_samples-1 due to LOO). This is why its effective dof
# is marginally different from the other estimators below that do refit on the
# entire dataset (n_samples).
H_ridge1 = X @ np.linalg.pinv(X.T @ X + ridge1.alpha_ * np.eye(p)) @ X.T
print('  Effective dof (using n_samples - 1):', 1 + H_ridge1.trace().round(2))

#
# ElasticNetCV with LOO CV
#
from sklearn.model_selection import LeaveOneOut

fold_size = len(X) - 1
alphas_scaled = alphas / fold_size #for equivalent cost func to RidgeCV

ridge2 = ElasticNetCV(
    cv=LeaveOneOut(), alphas = alphas_scaled,
    l1_ratio=1e-10, random_state=0
)
ridge2.fit(X, y)
y_pred = ridge2.predict(X=X)
MSE = mean_squared_error(y_true=y, y_pred=y_pred)

print('\nElasticNetCV using LOO')
print(
    '  best alpha | scaled:', round(ridge2.alpha_, 3),
    '| scaled:', round(ridge2.alpha_ * fold_size, 3), '<---'
)
print("  MSE:", round(MSE, 3))

#See H_ridge1 calculation for references
lambda2 = ridge2.alpha_ * len(X)
H_ridge2 = X @ np.linalg.pinv(X.T @ X + lambda2 * np.eye(p)) @ X.T
print('  Effective dof (using n_samples):', 1 + H_ridge2.trace().round(2))

#
# "Manual" version of ElasticNetCV
# i.e. ElasticNet + GridSearchCV using LOO
# Used for getting CV score for each alpha
#
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

search = GridSearchCV(
    ElasticNet(l1_ratio=1e-10),
    param_grid=dict(alpha=list( alphas / fold_size )),
    scoring='neg_mean_squared_error',
    cv=LeaveOneOut(),
    n_jobs=-1
).fit(X, y)

ridge3 = search.best_estimator_
y_pred = ridge3.predict(X)
MSE = mean_squared_error(y, y_pred)

print('\nElasticNet + GridSearchCV using LOO')
print(
    "  best alpha | scaled:", round(ridge3.alpha, 3),
    "| unscaled:", round(ridge3.alpha * fold_size, 3), '<---'
)
print("  MSE:", round(MSE, 3))

#See H_ridge1 calculation for references
lambda3 = ridge3.alpha * len(X)
H_ridge3 = X @ np.linalg.pinv(X.T @ X + lambda3 * np.eye(p)) @ X.T
print('  Effective dof (using n_samples):', 1 + H_ridge3.trace().round(2))

#
# Plot CV scores vs alpha
#
plt.scatter(
    alphas, ridge1.cv_values_.mean(axis=0),
    color='black', marker='s', s=50, label='RidgeCV'
)
plt.scatter(
    alphas, -search.cv_results_['mean_test_score'],
    color='gray', marker='.', label=r'ElasticNet(L1$\approx$0) + LOO CV'
)
plt.legend(fontsize=10)
plt.xlabel(r'$\alpha$')
plt.ylabel('CV MSE')
plt.gcf().set_size_inches(8, 3)

#Extra formatting
[plt.gca().spines[s].set_visible(False) for s in ['right', 'top']]
plt.gca().spines['bottom'].set_bounds(0, 3000)
plt.gca().spines['left'].set_bounds(1.35, 1.5)