pythonstatsmodels

Why would R-Squared decrease when I add an exogenous variable in OLS using python statsmodels


If I understand the OLS model correctly, this should never be the case?

trades['const']=1
Y = trades['ret']+trades['comms']
#X = trades[['potential', 'pVal', 'startVal', 'const']]
X = trades[['potential', 'pVal', 'startVal']]

from statsmodels.regression.linear_model import OLS
ols=OLS(Y, X)
res=ols.fit()
res.summary()

If I turn the const on, I get a rsquared of 0.22 and with it off, I get 0.43. How is that even possible?


Solution

  • see the answer here Statsmodels: Calculate fitted values and R squared

    Rsquared follows a different definition depending on whether there is a constant in the model or not.

    Rsquared in a linear model with a constant is the standard definition that uses a comparison with a mean only model as reference. Total sum of squares is demeaned.

    Rsquared in a linear model without a constant compares with a model that has no regressors at all, or the effect of the constant is zero. In this case the R squared calculation uses a total sum of squares that does not demean.

    Since the definition changes if we add or drop a constant, the R squared can go either way. The actual explained sum of squares will always increase if we add additional explanatory variables, or stay unchanged if the new variable doesn't contribute anything.

    A late addition:

    The constant in the design matrix can be explicit or implicit for this to hold. An implicit constant is when a linear combination of the explanatory variables is a constant.
    As an example, when we include all dummies for a one-way categorical variable, then the rows of the dummies add to one. So the model includes a constant even though there is not explicit constant or intercept term.