scikit-learnstatisticspearson-correlationmulticollinearitycoefficient-of-determination

Understanding Coefficient of Determination


I was going through the documentation to understand the Coefficient of Determination and from the document i got an understanding that Coefficient of Determination is nothing but R x R (correlation coefficient)

so i took the housing price dataset from kaggle.com and started to try on it for better understanding, this is my code

took the correlation coefficient

test_data=pd.read_csv(r'\house_price\test.csv')
_d=test_data.loc[:,['MSSubClass','LotFrontage']]
_d.fillna(0,inplace=True)
_d.corr()

enter image description here

now, taking the Coefficient of Determination like this

from sklearn.metrics import r2_score
r2_score(_d['MSSubClass'],_d['LotFrontage'])

for which, i got the value -0.9413195412943647

ideally shouldnt it be 0.060531252961 ? as -0.246031 x -0.246031 = 0.060531252961


Solution

  • following the docs: https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score

    the r2_score is defined as: enter image description here

    Whereer the df.corrmethod is (with pearson correlation): enter image description here

    so let's built an example:

    x   y
    1   1
    1   0
    0   0
    1   1
    

    correlation: 4*(1+0+0+1) - 3*2 / sqrt(4*(3-9)*4*(2-4)) = 8-6/ sqr(-24*4*-8) = 2/sqr(-24*4*-8) wherever R2 is: 1-((0)^2+(1)^2+(0)^2+(0)^2) / (1-0.75)^2+(1-0.75)^2+(0 - 0.75)^2 +(1-0.75)^2

    Hope that helps