Find the best fitted distribution for data and then generate random numbers from this distribution.
Using the gamlss
package in R, I found that the best fit distribution is "Skew exponential power (Azzalini type 1)"
:
library(gamlss)
library(gamlss.dist)
library(gamlss.add)
m1 <- fitDist(mtcars$wt, k = 2, type = "realAll", trace = FALSE, try.gamlss = TRUE)
summary(m1)
# *******************************************************************
# Family: c("SEP1", "Skew exponential power (Azzalini type 1)")
#
# Call: gamlssML(formula = y, family = DIST[i])
#
# Fitting method: "nlminb"
#
#
# Coefficient(s):
# Estimate Std. Error t value Pr(>|t|)
# eta.mu 3.440000000 0.000149516 23007.56651 < 2e-16
# eta.sigma -1.856665040 0.861826733 -2.15434 0.031214
# eta.nu 0.150728244 NA NA NA
# eta.tau -3.524272086 NA NA NA
#
# eta.mu ***
# eta.sigma *
# eta.nu
# eta.tau
# ---
# Signif. codes:
# 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Degrees of Freedom for the fit: 4 Residual Deg. of Freedom 28
# Global Deviance: -27.4747
# AIC: -19.4747
# SBC: -13.6117
# Warning message:
# In sqrt(diag(object$vcov)) : NaNs produced
But the values of sigma
and tau
are negative. When I provide these values to rSEP1()
to generate a random number, it throws the following error:
rSEP1(1, mu = 3.44, sigma = -1.8, nu = 0.15, tau = -3.5)
# Error in rSEP1(1, mu = 3.44, sigma = -1.8, nu = 0.15, tau = -3.5) :
# sigma must be positive
Are these values transformed? How can I provide correct inputs to rSEP1()
?
If you look at the link functions for the parameters, you will see:
SEP1()
#> GAMLSS Family: SEP1 Skew exponential power (Azzalini type 1)
#> Link function for mu : identity
#> Link function for sigma: log
#> Link function for nu : identity
#> Link function for tau : log
So the values for sigma and tau that fitDist
returns are the log of the numbers you would plug into rSEP1
. In other words, the correct way to run rSEP1
from your model is:
rSEP1(1, mu = 3.44, sigma = exp(-1.8), nu = 0.15, tau = exp(-3.5))
#> [1] 3.460991
To show that this is the case, let us create a set of random numbers from a defined rSEP1
where sigma = 0.5, and tau = 0.5. If we then find the best-fitting distribution, we should get results close to log(0.5) for both sigma and tau. Since log(0.5)
is about -0.69, we would expect values of about -0.69 for both these parameters:
set.seed(1)
testvec <- rSEP1(10000, mu = 3.5, sigma = 0.5, nu = 2.5, tau = 0.5)
m2 <- fitDist(testvec, k = 2, type = "realAll", trace = FALSE, try.gamlss = TRUE)
m2
#> Family: c("SEP1", "Skew exponential power (Azzalini type 1)")
#> Fitting method: "nlminb"
#>
#> Call: gamlssML(formula = y, family = DIST[i])
#>
#> Mu Coefficients:
#> [1] 3.5
#> Sigma Coefficients:
#> [1] -0.6828
#> Nu Coefficients:
#> [1] 2.421
#> Tau Coefficients:
#> [1] -0.6952
And plugging the exponents of sigma and tau into dSEP1
gives us a near-perfect fit to the density of our test vector:
d <- density(testvec)
plot(d$x, d$y)
lines(d$x, dSEP1(d$x, mu = 3.5, sigma = exp(-0.68), nu = 2.42, tau = exp(-0.69)),
col = "red", lwd = 2, lty = 2)
It's worth pointing out that the actual fit obtained for mtcars$wt
in the example is pretty terrible. The small value of sigma jyst means that most random values drawn from the distribution will be close to the mean of mtcars$wt
. There are only 32 points in the data set, which makes it very difficult to automatically fit a parametric distribution accurately.