pythonconstantsrfe

What does the high VIF for the constant term (intercept) indicate?


I am building a Linear regression model on a car dataset using RFE technique and statsmodels library. My final model has p-value well within 5% and has high F-statistics. VIF values for the predictor variables are well below 5 but for the constant term(intercept) VIF is 8.18. I have used add_constant method to add constant to the model. Following are my doubts:

  1. What does High variance for the constant indicate ?
  2. Should i ignore the constant term while calculating VIF?

These are my results:

This is the summary of my final model

VIF results for the model

I am new to machine learning and also posting question on this site for the 1st time. Kindly let me know if any more information is needed to answer my question.


Solution

  • statistical question are better asked on stats.stackexchange. However, I just went through this for statsmodels, e.g. https://github.com/statsmodels/statsmodels/issues/2376

    First, there is no multicollinearity problem in your model and data. p-values are low and confidence intervals are pretty narrow, so the parameters in the model should be a good estimates. A vif of 8 is not large.

    A large vif in the constant indicates that the (slope) explanatory variables have also a large constant component. An example would be when a variable has a large mean but only a small variance. An example for perfect collinearity with the constant and rank deficiency of the design matrix is the dummy variable trap, when we did not remove one of the levels of a categorical variable in dummy encoding and the dummies sum to 1 and, therefore, replicate a constant.

    The purpose of including the constant in the vif computation is to discover this kind of problems with the design matrix exog provided by the user. It would not show up if we compute vif on demeaned or standardized explanatory variables.

    There has been a long standing debate in statistics and econometrics about whether multicollinearity measures should include a constant or work only with demeaned explanatory variables.

    I am currently preparing an extension to statsmodels that gives users the option to compute both versions, with and without constant. In some cases reparameterization, demeaning and scaling, can improve numerical precision and prediction. So we want to have measures that check the actual design matrix provided by users, but also check a standardized version of the data to see whether demeaning and scaling might improve numerical precision.