I want to estimate predicted values for the mean (mu) and standard deviation (sigma) based on a gamlss model. However, it is not clear to me how to extract a standard deviation for given values of x
The data frame I am using looks like this:
#> head(abdom)
# y x
# 59 12.29
# 64 12.29
# 56 12.29
Here is the code to fit a gamlss model:
library(gamlss)
fit = gamlss(y ~ cs(x), sigma.formula = ~ cs(x), data = abdom, family = BCPE)
I want to calculate z-scores based on this model using the following approach: z = (y - mu)/sigma . Therefore, I use this code to calculate mu and sigma for each value of y and calculate the z scores. 95% of the z scores should lie between -2 and 2.
using predict function
mu = predict(fit, newdata = abdom, type = "response", what = "mu")
si = predict(fit, newdata = abdom, type = "response", what = "sigma")
z_score1 = (abdom$y - mu) / si
hist(z_score1)
using centiles.pred function
z_score2 = centiles.pred(fit, xname = "x", xvalues = abdom$x, yval = abdom$y, type = "z-scores")
hist(z_score2)
This leads to the following plots:
for z_score1, most scores are not even close to lie between -2 and 2.
Another way to approach this is by plotting the mean and standard deviation:
# calculating mu +/- 2*sigma
pred_dat = data.frame(x = 10:45)
mu = predict(fit, newdata = pred_dat, type = "response", what = "mu")
si = predict(fit, newdata = pred_dat, type = "response", what = "sigma")
hi = mu + (2 * si)
lo = mu - (2 * si)
pred_dat$mu = mu
pred_dat$hi = hi
pred_dat$lo = lo
# plotting
ggplot(data = pred_dat, aes(x = x)) +
geom_point(data = abdom, aes(x = x, y = y)) +
geom_line(aes(y = mu), colour = "red") +
geom_line(aes(y = hi), colour = "blue") +
geom_line(aes(y = lo), colour = "blue")
yielding the following plot:
Again, 95% of the values should lie between the two blue lines (hi and lo). But the values of the standard deviations are so low that there seems to be only one line.
So my questions are: first question: what do the values derived from predict represent if not the standard deviation conditional to x?
second question: how can I predict the standard deviation for a given x-value?
The gamlss package provides distribution functions for the BCPE
distribution, including qBCPE
. If you plug the coefficients from your model into this function at pnorm(1)
, then you will get the predicted value of y
at 1 standard deviation above the predicted mean. Since you can get the predicted mean with predict(fit)
, then you can easily get the standard deviation. The difficult part is getting the parameters from your model into qBCPE
. Here's a reprex:
library(gamlss)
library(ggplot2)
fit <- gamlss(y ~ cs(x), sigma.formula = ~ cs(x), data = abdom, family = BCPE)
Q <- qBCPE(pnorm(1),
mu = predict(fit),
sigma = exp(fit$sigma.coefficients[1] +
fit$sigma.coefficients[2] * cs(abdom$x)),
nu = fit$nu.coefficients,
tau = exp(fit$tau.coefficients))
SD <- c(Q - predict(fit))
Here, SD
gives the vector of standard deviations at each value of x
:
head(SD)
#> [1] 4.092467 4.092467 4.092467 4.203738 4.425361 4.425361
To show this is correct, let's plot 1.96 standard deviations on either side of the prediction line:
ggplot(data = data.frame(x = abdom$x, y = predict(fit),
upper = predict(fit) + 1.96 * SD,
lower = predict(fit) - 1.96 * SD), aes(x, y)) +
geom_point(data = abdom) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3) +
geom_line(color = "blue", linewidth = 1)
This looks good. Let's confirm that about 5% of observations lie outside 1.96 standard deviations of the mean:
(sum(abdom$y > predict(fit) + 1.96 * SD) +
sum(abdom$y < predict(fit) - 1.96 * SD)) / nrow(abdom)
#> [1] 0.0557377
And let's show that the calculated Z scores follow a standard normal distribution:
Z <- (abdom$y - predict(fit))/SD
hist(Z, breaks = 20, freq = FALSE)
lines(seq(-4, 4, 0.1), dnorm(seq(-4, 4, 0.1)))
This looks pretty good.
Created on 2023-01-09 with reprex v2.0.2