rglmdummy-variable

how to fix singularities in glm variables in R


I have an issue with adding one of my (dummy) variables into my poisson model. The main issue is that when I run my model with just this dummy as the explanatory variable (it consists of six dummy variables) and no intercept, it works perfectly fine. But when I try to add it on top of an existing model with more explanatory variables, it doesn't compute the final variable because of colinearity errors. That's confusing me, as it works normally when just the dummies are in the model.

I have already tried changing up the order of the explanatory variables, but as soon as I add the ' location' to any setup it doesn't calculate the final variable anymore.

As the final variable is relevant for my results, even if it is directly colinear with another variable, I am lost as for what to do

Thanks in advance for your help!

The model with just the variable location (R makes it into a dummy automatically):

model = glm(formula = taeni_number ~ 0 + location,
                         data=data, family=quasipoisson(link="log"))

The model with the other explanatory variables first:

model = glm(formula = taeni_number ~ 0 + grass + gravel + multi + location,
                         data=data, family=quasipoisson(link="log"))

The used dataset is:

structure(list(grass = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L), gravel = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L), multi = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L), taeni_number = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 0L, 0L, 
2L, 0L, 1L, 3L, 0L, 0L, 0L, 1L, 8L, 34L, 5L, 11L, 19L, 7L, 0L, 
1L, 2L, 0L, 1L, 0L, 14L, 19L, 10L, 15L, 8L, 3L, 0L, 1L, 0L, 0L, 
0L, 0L, 0L, 3L, 1L, 1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 4L, 9L, 
5L, 8L, 8L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 6L, 15L, 2L, 2L, 9L, 
3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 2L, 0L, 1L, 2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 
6L, 1L, 1L, 4L, 1L), location = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 
4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 
6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 
5L, 6L, 6L, 6L, 6L, 6L, 6L), levels = c("0", "1", "2", "3", "4", 
"5"), class = "factor")), class = "data.frame", row.names = c(NA, 
-108L))

Solution

  • The problem is actually that grass, gravel and multi are perfectly collinear as well. One and only one of those values will be 1 in any row. Since you've taken the intercept out, all coefficients for grass, gravel and multi can be estimated, but since the location dummies are also perfectly collinear one of them is dropped. If you change the ordering of the variables, putting location first, you'll see that the last coefficient from the collinear variables (e.g., multi in this case) is dropped.

    glm(formula = taeni_number ~ 0 + location + gravel + grass + multi ,
    +             data=d, family=quasipoisson(link="log"))
    
    # Call:  glm(formula = taeni_number ~ 0 + location + gravel + grass + 
    #     multi, family = quasipoisson(link = "log"), data = d)
    # 
    # Coefficients:
    # location0  location1  location2  location3  location4  location5     gravel      grass      multi  
    #   -2.4634    -1.0771    -1.9526     1.2583    -1.0771     1.2255     0.5193     1.1605         NA  
    # 
    # Degrees of Freedom: 108 Total (i.e. Null);  100 Residual
    # Null Deviance:        850.9 
    # Residual Deviance: 236.5  AIC: NA
    

    Edit: Adding Requested Tests

    The tests suggested in the comments do not need to be computed by including levels for all groups. You can do this in the conventional way - leaving a reference group for each categorical variable and then doing the tests after the fact. One way would be to do a generalized linear hypothesis test - which would allow you to test the difference between all pairs of values. First, we can manage the data and re-estimate the model with the categorical variables included:

    library(dplyr)
    library(tidyr)
    library(multcomp)
     
    d <- structure(list(grass = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
    0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
    1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
    0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
    0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
    1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
    0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
    0L, 1L, 0L), gravel = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
    0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
    0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
    1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
    0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
    0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
    1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
    0L, 0L), multi = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
    1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
    0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
    0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
    1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
    0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
    0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
    1L), taeni_number = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 0L, 0L, 
    2L, 0L, 1L, 3L, 0L, 0L, 0L, 1L, 8L, 34L, 5L, 11L, 19L, 7L, 0L, 
    1L, 2L, 0L, 1L, 0L, 14L, 19L, 10L, 15L, 8L, 3L, 0L, 1L, 0L, 0L, 
    0L, 0L, 0L, 3L, 1L, 1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 4L, 9L, 
    5L, 8L, 8L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 6L, 15L, 2L, 2L, 9L, 
    3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 2L, 0L, 1L, 2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 
    6L, 1L, 1L, 4L, 1L), location = structure(c(1L, 1L, 1L, 1L, 1L, 
    1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 
    4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 1L, 
    1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
    3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 
    6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
    3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 
    5L, 6L, 6L, 6L, 6L, 6L, 6L), levels = c("0", "1", "2", "3", "4", 
    "5"), class = "factor")), class = "data.frame", row.names = c(NA, 
    -108L))
    
    d <- d %>% 
      mutate(substrate = multi + 2*gravel + 3*grass, 
             substrate = factor(substrate, labels=c("Multiple", "Gravel", "Grass")))
    
    model <- glm(formula = taeni_number ~ location + substrate,
                data=d, family=quasipoisson(link="log"))
    
    

    You a use glht() from multcomp and its summary method to test all pairwise comparisons. Note, I have requested no multiplicity correction for the test by that can be changed by changing the type argument in the adjusted() function below:

    pwc_substrate <- summary(glht(model, linfct=mcp(substrate = "Tukey")), 
                             test=adjusted(type="none"))
    
    pwc_substrate
    #> 
    #>   Simultaneous Tests for General Linear Hypotheses
    #> 
    #> Multiple Comparisons of Means: Tukey Contrasts
    #> 
    #> 
    #> Fit: glm(formula = taeni_number ~ location + substrate, family = quasipoisson(link = "log"), 
    #>     data = d)
    #> 
    #> Linear Hypotheses:
    #>                        Estimate Std. Error z value Pr(>|z|)    
    #> Gravel - Multiple == 0   0.5193     0.2869   1.810  0.07031 .  
    #> Grass - Multiple == 0    1.1605     0.2604   4.457  8.3e-06 ***
    #> Grass - Gravel == 0      0.6412     0.2165   2.961  0.00306 ** 
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> (Adjusted p values reported -- none method)
    

    This tells you that Grass has significantly higher predictions than either Multiple or Gravel, but that Gravel and Multiple are not different from each other. The one for Location is below:

    pwc_location <- summary(glht(model, linfct=mcp(location = "Tukey")), 
                            test=adjusted(type="none"))
    
    pwc_location
    #> 
    #>   Simultaneous Tests for General Linear Hypotheses
    #> 
    #> Multiple Comparisons of Means: Tukey Contrasts
    #> 
    #> 
    #> Fit: glm(formula = taeni_number ~ location + substrate, family = quasipoisson(link = "log"), 
    #>     data = d)
    #> 
    #> Linear Hypotheses:
    #>              Estimate Std. Error z value Pr(>|z|)    
    #> 1 - 0 == 0  1.386e+00  1.005e+00   1.379    0.168    
    #> 2 - 0 == 0  5.108e-01  1.137e+00   0.449    0.653    
    #> 3 - 0 == 0  3.722e+00  9.101e-01   4.089 4.32e-05 ***
    #> 4 - 0 == 0  1.386e+00  1.005e+00   1.379    0.168    
    #> 5 - 0 == 0  3.689e+00  9.104e-01   4.052 5.08e-05 ***
    #> 2 - 1 == 0 -8.755e-01  8.291e-01  -1.056    0.291    
    #> 3 - 1 == 0  2.335e+00  4.709e-01   4.960 7.06e-07 ***
    #> 4 - 1 == 0  1.110e-15  6.359e-01   0.000    1.000    
    #> 5 - 1 == 0  2.303e+00  4.716e-01   4.883 1.05e-06 ***
    #> 3 - 2 == 0  3.211e+00  7.105e-01   4.519 6.20e-06 ***
    #> 4 - 2 == 0  8.755e-01  8.291e-01   1.056    0.291    
    #> 5 - 2 == 0  3.178e+00  7.109e-01   4.470 7.81e-06 ***
    #> 4 - 3 == 0 -2.335e+00  4.709e-01  -4.960 7.06e-07 ***
    #> 5 - 3 == 0 -3.279e-02  1.995e-01  -0.164    0.869    
    #> 5 - 4 == 0  2.303e+00  4.716e-01   4.883 1.05e-06 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> (Adjusted p values reported -- none method)
    

    It's a bit difficult to see what's going on here, but if you re-order the levels of the location factor to be increasing in the average number of animals collected, the results become a bit clearer (though note that they are exactly the same, just reordered).

    d$location <- reorder(d$location, d$taeni_number, mean)
    model <- glm(formula = taeni_number ~ location + substrate,
                 data=d, family=quasipoisson(link="log"))
    
    pwc_location <- summary(glht(model, linfct=mcp(location = "Tukey")), 
                            test=adjusted(type="none"))
    
    pwc_location
    #> 
    #>   Simultaneous Tests for General Linear Hypotheses
    #> 
    #> Multiple Comparisons of Means: Tukey Contrasts
    #> 
    #> 
    #> Fit: glm(formula = taeni_number ~ location + substrate, family = quasipoisson(link = "log"), 
    #>     data = d)
    #> 
    #> Linear Hypotheses:
    #>             Estimate Std. Error z value Pr(>|z|)    
    #> 2 - 0 == 0 5.108e-01  1.137e+00   0.449    0.653    
    #> 1 - 0 == 0 1.386e+00  1.005e+00   1.379    0.168    
    #> 4 - 0 == 0 1.386e+00  1.005e+00   1.379    0.168    
    #> 5 - 0 == 0 3.689e+00  9.104e-01   4.052 5.08e-05 ***
    #> 3 - 0 == 0 3.722e+00  9.101e-01   4.089 4.32e-05 ***
    #> 1 - 2 == 0 8.755e-01  8.291e-01   1.056    0.291    
    #> 4 - 2 == 0 8.755e-01  8.291e-01   1.056    0.291    
    #> 5 - 2 == 0 3.178e+00  7.109e-01   4.470 7.81e-06 ***
    #> 3 - 2 == 0 3.211e+00  7.105e-01   4.519 6.20e-06 ***
    #> 4 - 1 == 0 1.110e-15  6.359e-01   0.000    1.000    
    #> 5 - 1 == 0 2.303e+00  4.716e-01   4.883 1.05e-06 ***
    #> 3 - 1 == 0 2.335e+00  4.709e-01   4.960 7.06e-07 ***
    #> 5 - 4 == 0 2.303e+00  4.716e-01   4.883 1.05e-06 ***
    #> 3 - 4 == 0 2.335e+00  4.709e-01   4.960 7.06e-07 ***
    #> 3 - 5 == 0 3.279e-02  1.995e-01   0.164    0.869    
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> (Adjusted p values reported -- none method)
    

    Here, you can see that 5 and 3 are significantly bigger than all other levels, but according to the last row of the output, there is no significant difference between 5 and 3. Further, there are no significant differences among locations 0, 1, 2, and 4. This may be all you need, but there are some ways of visualizing these results, too. I have one that's in the psre packages, which you an install with

    install.packages("remotes")
    remotes::install_github("davidaarmstrong/psre")
    

    In addition to what you've done above, you'll need to create the effects for each of the variables using ggpredict():

    library(psre)
    library(ggeffects)
    library(ggplot2)
    
    eff_substrate <- ggpredict(model, 
                     "substrate", 
                     ci.lvl = .95)
    
    eff_location <- ggpredict(model, 
                          "location", 
                          ci.lvl = .95)
    

    Then, you can extract the compact letter display information from the pairwise comparisons generated above:

    cld_substrate <- cld(pwc_substrate)
    cld_location <- cld(pwc_location)
    substrate_lmat <- cld_substrate$mcletters$LetterMatrix
    location_lmat <- cld_location$mcletters$LetterMatrix
    

    Next, reorder the factor levels by predicted number of animals caught:

    eff_substrate$x <- reorder(eff_substrate$x, eff_substrate$predicted, mean)
    eff_location$x <- reorder(eff_location$x, eff_location$predicted, mean)
    

    Finally, pass the effects and letter information to the letter_plot() function. The points with lines through them are predicted values with 95% confidence intervals. The points in the right-hand margin of the plot represent the compact letter display. Groups with the same letters are not significantly different from each other. Groups with different letters are significantly different from each other.

    letter_plot(eff_substrate, substrate_lmat) + 
      labs(x="Predicted # Animals Caught\n(95% Confidence Interval)") + 
      ggtitle("Effect of Substrate")
    #> Joining with `by = join_by(x)`
    #> Warning: Removed 1 rows containing missing values (`geom_point()`).
    #> Warning: Removed 2 rows containing missing values (`geom_point()`).
    

    
    letter_plot(eff_location, location_lmat) + 
      labs(x="Predicted # Animals Caught\n(95% Confidence Interval)") + 
      ggtitle("Effect of Location")
    #> Joining with `by = join_by(x)`
    #> Warning: Removed 2 rows containing missing values (`geom_point()`).
    #> Warning: Removed 4 rows containing missing values (`geom_point()`).
    

    Created on 2023-05-26 with reprex v2.0.2