rpredictionglmbinomial-coefficients

Generating predictions from an aggregated binomial regression


Assessing model accuracy is reasonably easy with Bernoulli outcomes, but I am unsure how to generate meaningful predictions from an aggregated binomial regression.

Take this example. We want to model the number of drug counselling sessions (variable numCouns) a client attends over a twelve-week period based on: (1) how many years they had been using cannabis regularly prior to starting treatment (variable durationRegUse) and (2) the number of grams of cannabis they used on an average day (variable gms). The maximum number of counselling sessions each client can attend is six.

Here is the data

df <- data.frame(durationRegUse = c(19, 9, 13, 19, 10, 13, 2, 14, 11, 12, 7, 6, 3, 18, 17, 9, 9, 10, 0, 20, 4, 4, 8, 5, 4, 19, 25, 10, 27, 1, 10, 25, 8, 24, 8, 18, 15, 10, 6, 14, 16, 13, 4, 4, 5, 17, 13, 21, 8, 7, 10, 17, 13, 12, 28, 38, 23, 19, 36, 3, 14, 14, 22, 11, 26, 17, 4, 8, 25, 35, 14, 28, 32, 29, 22, 21, 2, 23, 35, 34, 31, 34, 15, 14, 26, 6, 3, 25, 24, 31, 31, 27, 30, 14.5, 12, 9, 3, 13, 5, 6, 23, 21, 27, 7, 36, 19, 22, 15, 11, 17, 11, 26, 21, 15),
                 gms = c(3.5, 2, 0.5, 10, 3, 3, 4, 4, 2, 2, 2, 2, 2, 2, 1, 1.75, 4, 1.75, 0.33, 5, 2.5, 1.25, 1, 0.5, 3, 2, 5, 3, 3, 0.571, 1, 0.5, 2, 4, 2.5, 1.25, 1.5, 1, 2.5, 2, 1, 2, 1.5, 2, 0.2, 1, 1, 2, 14, 2, 3.5, 3, 2, 1.75, 2, 0.55, 1, 2, 6, 0.5, 0.5, 0.5, 3, 1, 2.75, 4.5, 3, 3, 3, 2, 2, 1, 2.5, 1.75, 1, 1.5, 2, 0.7, 7, 0.5, 2, 1.2, 0.4, 3, 0.8, 1.3, 1.2, 2, 1.5, 3, 2, 2, 4, 3, 1, 6, 1, 0.5, 1.5, 2.5, 1, 2.5, 1.5, 1, 1.5, 2.5, 1.5, 2.5, 10, 1.5, 1.5, 0.5, 5, 1.5),
                 numCouns = c(6, 1, 2, 6, 0, 6, 0, 0, 2, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 2, 5, 6, 0, 0, 6, 0, 6, 3, 6, 0, 0, 0, 4, 5, 0, 0, 4, 0, 4, 3, 0, 1, 2, 6, 4, 2, 4, 3, 1, 0, 2, 2, 5, 2, 0, 1, 3, 0, 3, 2, 1, 6, 0, 0, 1, 0, 1, 2, 0, 0, 5, 1, 1, 1, 5, 3, 5, 6, 6, 5, 3, 6, 2, 4, 3, 4, 6, 1, 0, 6, 4, 3, 3, 1, 5, 0, 1, 1, 6, 6, 6, 3, 3, 2, 0, 0, 5, 1, 6, 3, 0, 0))

To model it as an aggregated binomial regression we need to create a coverage variable (the max number of sessions.)

df$coverage <- 6

Now we can create the aggregated binomial regression model

aggBinMod <- glm(
             formula = cbind(numCouns, coverage - numCouns) ~ durationRegUse + gms,
                 data = df,
                 family = binomial(link = "logit"))

And here is the output

summary(aggBinMod)

#output
# Coefficients:
#                 Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -1.157570   0.183116  -6.322 2.59e-10 ***
# durationRegUse  0.035975   0.008455   4.255 2.09e-05 ***
# gms             0.075838   0.039273   1.931   0.0535 .

Now is the part I am unsure of: How to generate predictions with which to assess model accuracy. Now, as I understand it if we use the predict() function, selecting "response" as the type we get a predicted per-trial probability of drawing a 1 from a Bernoulli response scale (i.e. [0,1]).

predBin <- predict(aggBinMod, type = "response")
predBin
# (predicted bernoulli probability for first 16 participants)
# 1         2         3         4         5         6         7         8 
# 0.4480346 0.3357882 0.3425441 0.5706073 0.3611657 0.3864206 0.3138308 0.4132440 
# 9        10        11        12        13        14        15        16 
# 0.3520203 0.3602692 0.3199350 0.3121589 0.2894678 0.4113600 0.3845787 0.3315728

So, following that logic, in order to generate predictions for the number of sessions for each client from our aggregated binomial regression model, we should be able to simply multiply this value by the number of trials we wish to predict, in our case 6. So to generate the predictions we would run

predBin6 <- predict(aggBinMod, type = "response")*6
predBin6
# predicted number of sessions, out of a possible 6), for first 18 clients
# 1        2        3        4        5        6        7        8        9 
# 2.688208 2.014729 2.055265 3.423644 2.166994 2.318524 1.882985 2.479464 2.112122 
# 10       11       12       13       14       15       16       17       18 
# 2.161615 1.919610 1.872954 1.736807 2.468160 2.307472 1.989437 2.222478 2.037563 

And from there it is straightforward to assess model accuracy via the mean squared error

error <- predBin6 - df$numCouns
mse <- mean(error^2)
mse

# output
# [1] 4.871892

So my question is is this the correct way to generate predictions from an aggregated binomial regression?


Solution

  • More or less, yes.

    Instead of hard-coding the fact that there are 6 trials per observation (in some applications the number of trials differs from observation to observation), I would recommend

    predBin6 <- predict(aggBinMod, type = "response")*weights(aggBinMod)
    

    (which should give the same answer in your case).

    I would also say that MSE is reasonable, but not necessarily the best measure of predictive accuracy for a binomial model (it doesn't take the dependence of the variance on the mean into account). (I don't have a particular alternative recommendation, but the deviance (deviance(aggBinMod)) or something similar might be appropriate.)