I have a data set of household incomes. I want to fit a log-normal distribution to the data then generate random numbers from that distribution.
The approach below using the MASS package seems to give incorrect results. When I create a large sample from a log-normal distribution with the parameters from fitdistr
, the mean and standard deviation are both much higher than in the original data.
library(MASS)
mean(hhi) # = 124,027.8
sd(hhi) # = 127,492.1
params <- fitdistr(hhi,'lognormal') # mu = 11.31322, sigma = 1.016148
random_hhi <- rlnorm(1000000, params$estimate['meanlog'], params$estimate['sdlog'])
mean(random_hhi) # returns 137,131.5
sd(random_hhi) # returns 182,139
This post, suggested in a comment on a similar question, offers a different approach based on the definitions for the mean and variance of the log-normal distribution. This approach seems to work better in my case. Notice that the mean and SD are much closer to those in the data.
m <- mean(hhi)
s <- sd(hhi)
location <- log(m^2 / sqrt(s^2 + m^2)) # returns 11.3677234406609
shape <- sqrt(log(1 + (s^2 / m^2))) # returns 0.8491615
random_hhi <- rlnorm(n=1000000, location, shape)
mean(random_hhi) # returns 124,024.1
sd(random_hhi) # returns 127,291.8
So, my question is: Why does the first approach fail and the second succeed? I would expect fitdistr
to return the parameters mu
and sigma
that best fit the data so that numbers generated from those parameters with rlnorm
would follow the same distribution. What am I missing?
The two ways you obtain these parameters are not equal. MASS::fitdistr
performs maximum likelihood estimation, whereas in the second case you have taken theoretical identities as if they reflect the exact truth for your sample.
To be clear, rlnorm
(or any other stats
random sampling function) returns exactly the distribution you have asked for. By inverting the identities you can trivially verify that a log-normal with mean=11.313 and SD=1.016 will approximately produce the parameters you got in the original scale (unfortunately the log-normal is quite heavily skewed which makes original-scale SD a poor summary measure). The issue lies in how you got to those distributional parameters.
A small example:
set.seed(123)
## Doesn't really follow any univariate distribution
x <- c(rnorm(15, 15, 5), rnorm(5, 150, 30))
## What is the lognormal distribution that is most likely to have produced this sample?
MASS::fitdistr(x, "lognormal")
#> logmean=3.292, logsd=1.020
## What is the mean & SD of the lognormal distribution following the observed mean & SD?
c(logmean=log(mean(x)^2 / sqrt(sd(x)^2 + mean(x)^2)),
logsd=sqrt(log(1 + (sd(x)^2 / mean(x)^2))))
#> logmean=3.429, logsd=0.985
Again, these are both 'correct' but answers to different questions. From this it should be clear why the second approach approximates your observed parameters much more closely. The same would happen using the normal distribution:
MASS::fitdistr(x, "normal")
#> mean=50.142, sd=62.584
c(mean=mean(x), sd=sd(x))
#> mean=50.141, sd=64.210
Simulating from these sets of parameters will clearly give you different SDs (that's what you're providing as input), and the second one will reproduce the observed SD whereas the first one won't.
I cannot tell you which of these two is the one you want since you didn't provide more details, but it seems highly unlikely that you just want to reproduce your observed summary statistics exactly (you already have these). If anything, the mismatch between the observed summaries and the MLE suggests that the log-normal is not an exact fit for your data.
As an addendum, here's code that will let you reproduce what MASS::fitdistr
does, i.e. how the maximum likelihood parameter estimates are calculated:
mle_normal <- function(x) {
mu <- mean(x)
sig <- sqrt(mean((x-mu)**2))
c(mu=mu, sig=sig)
}
mle_lognormal <- function(x) {
stopifnot(all(x > 0))
est <- mle_normal(log(x))
c(logmu=est[1], logsig=est[2])
}
And finally, while the only difference between the MLE and the summary in the normal case is the Bessel correction (variance with n-1 in the denominator instead of n), in the log-normal case that doesn't hold because original-scale mean and variance are no longer independent.