python-3.xscikit-learngridsearchcvscoring

calculate adjusted R2 using GridSearchCV


I am trying to use GridSearchCV with multiple scoring metrics, one of which, the adjusted R2. The latter, as far I am concerned, is not implemented in scikit-learn. I would like to confirm whether my approach is the correct one to implement the adjusted R2.

Using the scores implemented in scikit-learn (in the example below MAE and R2), I can do something like shown below (in this dummy example I am ignoring good practices, like feature scaling and a suitable number of iterations for SVR):

import numpy as np
from sklearn.svm import SVR
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_error

#generate input
X = np.random.normal(75, 10, (1000, 2))
y = np.random.normal(200, 20, 1000)

#perform grid search
params = {"degree": [2, 3], "max_iter": [10]}
grid = GridSearchCV(SVR(), param_grid=params,
                    scoring={"MAE": "neg_mean_absolute_error", "R2": "r2"}, refit="R2")
grid.fit(X, y)

The example above will report the MAE and R2 for each cross-validated partition and will refit the best parameters based on the best R2. Following this example, I have attempted to do the same using a custom scorer:

def adj_r2(true, pred, p=2):
    '''p is the number of independent variables and n is the sample size'''
    n = true.size
    return 1 - ((1 - r2_score(true, pred)) * (n - 1))/(n-p-1)

scorer=make_scorer(adj_r2)
grid = GridSearchCV(SVR(), param_grid=params,
                    scoring={"MAE": "neg_mean_absolute_error", "adj R2": scorer}, refit="adj R2")
grid.fit(X, y)

#print(grid.cv_results_)

The code above appears to generate values for the "adj R2" scorer. I have two questions:

  1. Is the approach used above technically correct coding-wise?
  2. If the approach is correct, how can I define p (number of independent variables) in a dynamic way? As you can see, I had to force a default when defining the function, but I would like to be able to define p in GridSearchCV.

Solution

  • Firstly, adjusted R2 score is not available in sklearn so far because the API of scoring functions just takes y_true and y_pred. Hence, measuring the dimensions of X is out of question.

    We can do a work around for SearchCVs.

    The scorer needs to have a signature of (estimator, X, y). This has been delivered in the make_scorer here.

    I have provided a more simplified version of that here for wrapping the r2 scorer.

    def adj_r2(estimator, X, y_true):
        n, p = X.shape
        pred = estimator.predict(X)
        return 1 - ((1 - r2_score(y_true, pred)) * (n - 1))/(n-p-1)
    
    grid = GridSearchCV(SVR(), param_grid=params, 
                        scoring={"MAE": "neg_mean_absolute_error", 
                                 "adj R2": adj_r2}, refit="adj R2") 
    grid.fit(X, y)