pythonscikit-learn

`sklearn.metrics.r2_score` is giving wrong R2 value?


I notice that sklearn.metrics.r2_score is giving wrong R2 value.

from sklearn.metrics import r2_score

r2_score(y_true=[2,4,3,34,23], y_pred=[21,12,3,11,17])   # -0.17
r2_score(y_pred=[21,12,3,11,17], y_true=[2,4,3,34,23])   # -4.36

However, the true R2 value should be 0.002 according to the rsq function in Excel. R2 should be between 0~1. Also, switching the order of "y_true" and "y_pred" should not affect the final result. How to fix this issue?

Also,

In simple linear regression (one predictor), the coefficient of determination is numerically equal to the square of the Pearson correlation coefficient.

I wonder why sklearn.metrics.r2_score is different to the squared Pearson correlation coefficient in this case?


Solution

  • The site in question gives you a different r-squared. What it gives you is a squared version of the Pearson's correlation coefficient (r), which is also a commonly used metric.

    The actual R2 as in coefficient of determination is calculated as

    1 - (SS_res / SS_tot)

    where SS_res is the sum of squared residuals and SS_tot is the total sum of squares.

    Link to Wiki article

    You can recreate the calculation and confirm that sklearn is correct and the internet is wrong (technically, internet is also correct but it's just a different metric altogether):

    import numpy as np
    
    y_true = np.array([2,4,3,34,23])
    y_pred = np.array([21,12,3,11,17])
    
    ss_res = np.sum((y_true - y_pred)**2)
    ss_tot = np.sum((y_true - np.full(y_true.shape[0], np.mean(y_true)))**2)
    
    r2 = 1 - (ss_res / ss_tot)  # Out: -0.174655908875178
    

    For the second question - why results are different when we swtich True and Pred around the thing is that in sklearn.metrics.r2_score variables y_true and y_pred are positional-only depending on the version of sklearn, i.e. the one that goes first becomes y_true and the second is y_pred.

    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

    If you use the snippet above to run the calculation with values flipped around, you see that sklearn is correct again. In this case SS_res would not change but SS_tot will change because you're only using y_true and its mean in the calculation.

    UPDATE: In order to get the squared correlation coefficient that you get from Excel (as per discussion in the comments) you can scipy instead:

    from scipy.stats import pearsonr
    
    r2 = pearsonr([2,4,3,34,23], [21,12,3,11,17])[0] ** 2  # Out: 0.002366878494073563
    

    https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html