rglm

Extracting model coefficients for formula terms


If I fit an lm or glm with factors as explanatory variables, I get a model with estimates and SEs for each level of the factor (minus the baseline level). How can I reliably extract that info for estimates relating to each term in the formula?

Example:

> hec = reshape2::melt(HairEyeColor)
> head(hec)
   Hair   Eye  Sex value
1 Black Brown Male    32
2 Brown Brown Male    53
3   Red Brown Male    10

Then fit a model with two factors:

m = glm(value ~ -1 + Hair + Eye, data=hec)

The fitted coefficients are available in a named vector (and as a full table with these rownames via summary(m)$coef):

> coef(m)
HairBlack HairBrown   HairRed HairBlond   EyeBlue  EyeHazel  EyeGreen 
   22.500    44.750    17.875    24.875    -0.625   -15.875   -19.500 

but now I want to get the first four out since they relate to Hair, and the next three as they relate to Eye.

I could do this using string pattern matching (which I have been doing for specific examples, but I'm doing this often enough I want to get past this) but I am hoping there's a package or a function or an easy way to do this without recourse to string matching, which can possibly break if you construct your data perversely enough and get identical matching coefficient names...

(In the following example, both foo and foob terms generate a coefficient named "foobaz".)

> d = data.frame(foo=sample(c("bar","baz","qux","fnord"),100,TRUE), foob=sample(c("ar","az","sqq","moo"),100,TRUE), y=rnorm(100))
> coef(glm(y~foo+foob, data=d))
(Intercept)      foobaz    foofnord      fooqux      foobaz     foobmoo 
-0.37880701  0.40920067  0.10645248  0.19536840  0.21929754 -0.05912854 
    foobsqq 
 0.37137127 

This is also fragile for interaction terms, where the order of interacted factors depends on the order the terms appear in the formula. You could have X:Y in your formula and get a term out with Y:X if Y has already appeared in the formula. Great.

I've looked at the fitted model object for clues as to how outputs relate to formula terms, and I think something could be done using the $terms component, then using that to see how many levels are in each factor, and getting that many elements from the coefficients at a time, but also that requires compensating for intercept terms, and then interaction terms add another level of complexity.

So before I code this all up, has it been done?


Solution

  • Example:

    d <- data.frame(foo = sample(c("bar","baz","qux","fnord"), 100, TRUE),
                    foob = sample(c("ar","az","sqq","moo"), 100, TRUE),
                    y = rnorm(100))
    mod <- glm(y ~ foo + foob, data = d)
    

    model.matrix() is a function that generates a design matrix for a model. Apart from this, it also contains an attribute "assign", an integer vector with an entry for each column in the matrix giving the term in the formula which gave rise to the column.

    attr(model.matrix(mod), "assign")
    # [1] 0 1 1 1 2 2 2
    

    Value 0 corresponds to the intercept (if any), and positive values to terms in the order given by the "term.labels" attribute of the terms structure corresponding to the model.

    attr(terms(mod),  "term.labels") # or labels(terms(mod))
    # [1] "foo"  "foob"
    

    By combining the above, we can write a function that extracts the coefficients of specific term(s) from a model.

    getCoef <- function(term, model) {
      ind <- attr(model.matrix(model), "assign")
      lab <- labels(terms(model))
      coef(model)[ind %in% match(term, lab)]
    }
    
    getCoef("foo", mod)
    #      foobaz    foofnord      fooqux
    #  0.12634573  0.27871612 -0.04792015
    
    getCoef("foob", mod)
    #     foobaz    foobmoo    foobsqq
    # -0.1596896 -0.1452948 -0.0532562
    
    getCoef(c("foo", "foob"), mod)
    #      foobaz    foofnord      fooqux      foobaz     foobmoo     foobsqq
    #  0.12634573  0.27871612 -0.04792015 -0.15968959 -0.14529477 -0.05325620
    # (the two foobaz's from `foo` and `foob` are all extracted)
    

    If a nonexistent term is provided, ...

    getCoef("bar", mod)
    # named numeric(0)