rstatisticslinear-regressionscalecoefficients

regression coefficients: changing the scale of the independent variables?


I'm running a weighted regression model, but I have no idea how to deal with some variable that I need to put inside

My dependent variable has values with a scale of thousands, while my independent variables have scale of tens and hundreds or are categorical variables.

I usually run the regression with the log of the dependent variable ( in this way I can interpret the estimated coefficient as % increase )

Here an example

regression

How to handle instead with a variable between the regressors that has a scale of millions ?

For example, I include in my regression the variable occ_tot expressed in millions. This is what happens

regression 2

How should I interpret these coefficients? Is there a nice way to include an independent variable with a bigger scale of the dependent one?

I'm new with these kind of things...


Solution

  • We can scale a predictor as desired and normally the coefficients will simply compensate so that if we multiply a predictor by 100 the corresponding coefficient will get divided by 100 and visa versa while the other coefficients will not be affected.

    If some predictors are close to linearly dependent one can run into problems but that is the case even without scaling so that is really a separate problem. Look at findCorrelation in caret to eliminate highly correlated predictors and try it with and without eliminating such predictors to see if it matters in your case.

    The first lm below is our original regression. In the second lm we multiply the wt predictor by 100 and we see that the coefficient simply gets divided by 100 and the other coefficients stay the same. A similar thing happens with the third lm where we divide by 100 and the coefficient compensates again while the other coefficients are again unchanged.

    Also note that all three of these are really the same model except for parameterization so they result in the same fitted values, the same residuals and the same residual sum of squares.

    coef(lm(mpg ~ cyl + wt, mtcars))  # original lm
    ## (Intercept)         cyl          wt 
    ##   39.686261   -1.507795   -3.190972 
    
    coef(lm(mpg ~ cyl + I(100 * wt), mtcars))
    ## (Intercept)         cyl I(100 * wt) 
    ## 39.68626148 -1.50779497 -0.03190972
    
    coef(lm(mpg ~ cyl + I(wt/100), mtcars)) 
    ## (Intercept)         cyl   I(wt/100) 
    ##   39.686261   -1.507795 -319.097214 
    

    If we have a log predictor then changing wt to 100*wt will only affect the intercept because log(100 * wt) = log(100) + log(wt). Below the first lm we take log(wt) and in the second one we take log(100*wt).

    coef(lm(mpg ~ cyl + log(wt), mtcars))
    ## (Intercept)         cyl     log(wt) 
    ##   40.649777   -1.233526  -11.523812 
     
    coef(lm(mpg ~ cyl + log(100 * wt), mtcars))
    ## (Intercept)           cyl log(100 * wt) 
    ##   93.718894     -1.233526    -11.523812