rpredictgamlss

r gamlss: predicting standard deviation and calculating z-scores


I want to estimate predicted values for the mean (mu) and standard deviation (sigma) based on a gamlss model. However, it is not clear to me how to extract a standard deviation for given values of x

The data frame I am using looks like this:

#> head(abdom)
#  y     x
# 59 12.29
# 64 12.29
# 56 12.29

Here is the code to fit a gamlss model:

library(gamlss)
fit = gamlss(y ~ cs(x), sigma.formula = ~ cs(x), data = abdom, family = BCPE)

I want to calculate z-scores based on this model using the following approach: z = (y - mu)/sigma . Therefore, I use this code to calculate mu and sigma for each value of y and calculate the z scores. 95% of the z scores should lie between -2 and 2.

using predict function

mu = predict(fit, newdata = abdom, type = "response", what = "mu")
si = predict(fit, newdata = abdom, type = "response", what = "sigma")
z_score1 = (abdom$y - mu) / si
hist(z_score1)

using centiles.pred function

z_score2 = centiles.pred(fit, xname = "x", xvalues = abdom$x, yval = abdom$y, type = "z-scores")
hist(z_score2)

This leads to the following plots:

enter image description here

for z_score1, most scores are not even close to lie between -2 and 2.

Another way to approach this is by plotting the mean and standard deviation:

# calculating mu +/- 2*sigma
pred_dat = data.frame(x = 10:45)
mu = predict(fit, newdata = pred_dat, type = "response", what = "mu")
si = predict(fit, newdata = pred_dat, type = "response", what = "sigma")
hi = mu + (2 * si)
lo = mu - (2 * si)

pred_dat$mu = mu
pred_dat$hi = hi
pred_dat$lo = lo

# plotting
ggplot(data = pred_dat, aes(x = x)) +
  geom_point(data = abdom, aes(x = x, y = y)) +
  geom_line(aes(y = mu), colour = "red") +
  geom_line(aes(y = hi), colour = "blue") +
  geom_line(aes(y = lo), colour = "blue")

yielding the following plot:

enter image description here

Again, 95% of the values should lie between the two blue lines (hi and lo). But the values of the standard deviations are so low that there seems to be only one line.

So my questions are: first question: what do the values derived from predict represent if not the standard deviation conditional to x?

second question: how can I predict the standard deviation for a given x-value?


Solution

  • The gamlss package provides distribution functions for the BCPE distribution, including qBCPE. If you plug the coefficients from your model into this function at pnorm(1), then you will get the predicted value of y at 1 standard deviation above the predicted mean. Since you can get the predicted mean with predict(fit), then you can easily get the standard deviation. The difficult part is getting the parameters from your model into qBCPE. Here's a reprex:

    library(gamlss)
    library(ggplot2)
    
    fit <- gamlss(y ~ cs(x), sigma.formula = ~ cs(x), data = abdom, family = BCPE)
    
    Q <- qBCPE(pnorm(1), 
               mu    = predict(fit),
               sigma = exp(fit$sigma.coefficients[1] + 
                       fit$sigma.coefficients[2] * cs(abdom$x)),
               nu =    fit$nu.coefficients, 
               tau =   exp(fit$tau.coefficients)) 
    
    SD <- c(Q - predict(fit))
    

    Here, SD gives the vector of standard deviations at each value of x:

    head(SD)
    #> [1] 4.092467 4.092467 4.092467 4.203738 4.425361 4.425361
    

    To show this is correct, let's plot 1.96 standard deviations on either side of the prediction line:

    ggplot(data = data.frame(x = abdom$x, y = predict(fit),
                             upper = predict(fit) + 1.96 * SD,
                             lower = predict(fit) - 1.96 * SD), aes(x, y)) +
      geom_point(data = abdom) +
      geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3) +
      geom_line(color = "blue", linewidth = 1)
    

    This looks good. Let's confirm that about 5% of observations lie outside 1.96 standard deviations of the mean:

    (sum(abdom$y > predict(fit) + 1.96 * SD) +
      sum(abdom$y < predict(fit) - 1.96 * SD)) / nrow(abdom)
    #> [1] 0.0557377
    

    And let's show that the calculated Z scores follow a standard normal distribution:

    Z <- (abdom$y - predict(fit))/SD
    hist(Z, breaks = 20, freq = FALSE)
    lines(seq(-4, 4, 0.1), dnorm(seq(-4, 4, 0.1)))
    

    enter image description here

    This looks pretty good.

    Created on 2023-01-09 with reprex v2.0.2