rregressioncorrelationglmlm

How to extract Correlation of Coefficients table from models like glm?


After running a statistical model in R (e.g., glm, lm, lme4::lmer, etc.), I run the summary() command with corr=TRUE to get the Correlation of Coefficients table. It features a matrix of correlations for the intercept and the independent variables in the given model. What I would like to do is extract the columns that make those correlations possible. For instance, if the model is x ~ a + b, and the table looks like this...

 (Intercept)   a   
a    0.##      
b    0.##     0.##

I want to extract the columns that allow for the correlation between a and b.

I have found various commands to extract fitted values and residuals, but nothing to yet that would give me values for a and b that correspond to their correlations in that table.


Solution

  • What you want is the summary(.)$correlation.

    If you want to extract something from an object, look into the structure str(object) to find the desired element. (If you are working in RStudio, perhaps start with str(object, max.levels=1) as it tends to hang up or crash if the output is too large.)

    Looking into the source code of fitted() methods such as stats:::fitted.lm reveals, that they essentially extract the $fitted.values element of the respective object (similar applies to resid()).

    Same way we could write a function ex_corr_lm(). To obtain the RHS variables automatically we may use all.vars() on the formula element.

    The object also contains the model data which is the part of data from the original data used to fit the model. If there were NAs, they might be removed. We can use it to reconstruct the columns that "make those correlations possible".

    > ex_corr_lm <- \(object, data=FALSE, vars) {
    +   if (missing(vars)) {
    +     vars <- all.vars(s$call$formula)[-1]
    +   } 
    +   if (inherits(object, c("lm", "glm"))) {
    +     s <- summary(object, corr=TRUE)
    +     corr <- s$correlation[vars, vars]
    +     corr_data <- object$model[, vars]
    +   } else if (inherits(object, "lmerMod")) {
    +     s <- summary(object, corr=TRUE)
    +     corr <- as.matrix(s$vcov@factors$correlation)[vars, vars]
    +     corr_data <- object@frame[, vars]
    +   } else {
    +     stop('not implemented.')
    +   }
    +   if (data) {
    +     list(corr=corr, corr_data=corr_data)
    +   } else {
    +     corr
    +   }
    + }
    

    > ex_corr_lm(f_lm)
               a          b
    a  1.0000000 -0.4798395
    b -0.4798395  1.0000000
    > 
    > ex_corr_lm(f_lm, vars=c('a', 'b'))  ## explicitly specify vars
               a          b
    a  1.0000000 -0.4798395
    b -0.4798395  1.0000000
    > 
    > ex_corr_lm(f_glm)
               a          b
    a  1.0000000 -0.4798395
    b -0.4798395  1.0000000
    > 
    > ex_corr_lm(f_lmer)
               a          b
    a  1.0000000 -0.4798395
    b -0.4798395  1.0000000
    

    If we set data=TRUE, we can obtain the respective columns. ex_corr_lm() throws a list with correlations and data columns in this case.

    > ex_corr_lm(f_lm, TRUE)
    $corr
               a          b
    a  1.0000000 -0.4798395
    b -0.4798395  1.0000000
    
    $corr_data
       a   b
    1  1   1
    2  2   4
    3  3   9
    4  4  16
    5  5  25
    6  1  36
    7  2  49
    8  3  64
    9  4  81
    10 5 100
    

    We can use cor() on them. Note, that correlations of the estimated coefficients and the correlation between the variables themselves aren't necessarily the same.

    > ex_corr_lm(f_lm, TRUE)$corr_data |> cor()
              a         b
    a 1.0000000 0.4798395
    b 0.4798395 1.0000000
    

    Data:

    > set.seed(42)
    > d <- data.frame(a=rep(1:5, 2), b=(1:10)^2);d$x <- .5 + 2*d$a + .1*d$b + rnorm(10,,.5)
    > f_lm <- lm(x ~ a + b, d,)
    > f_glm <- glm(x ~ a + b, d, fam=gaussian())
    > f_lmer <- lme4::lmer(x ~ a + b + (1|a), d)