I have a vector of 1096 numbers, the daily average concentration of NOx measured in 3 years in a measurement station. You can observe the type of distribution in the image:
I used these commands to do the histogram:
NOxV<-scan("NOx_Vt15-17.txt")
hist.NOxVt<-hist(NOxV, plot = FALSE, breaks = 24)
plot(hist.NOxVt, xlab = "[NOx]", ylab = "Frequenze assolute", main = "Istogramma freq. ass. NOx 15-17 Viterbo")
points(hist.NOxVt$mids, hist.NOxVt$counts, col= "red")
My professor suggested that I fit the histogram with a Poisson distribution - paying attention to the transition: discrete -> continuous (I don't know what that means)- or with a "Lognormal" distribution.
I tried to do the Poisson fit, with some command lines that she gave us at lesson, but R gave me an error after having executed the last code line of the following:
my_poisson = function(params, x){
exp(-params)*params^x/factorial(x)
}
y<-hist.NOxVt$counts/1096;
x<-hist.NOxVt$mids;
z <- nls( y ~ exp(-a)*a^x/factorial(x), start=list(a=1) )
Error in numericDeriv(form[[3L]], names(ind), env) : Missing value or an infinity produced when evaluating the model In addition: There were 50 or more warnings (use warnings() to see the first 50)"
After this problem I couldn't solve (even searching similar problems on the internet) I decided to fit the distribution with a Lognormal, but I have no idea how to do it, because the professor did not explain it to us, and I still don't have enough R experience to figure it out on my own.
There's a built-in function fitdistr
in the MASS
package that comes with R:
Generating a data example to look at (eyeballing parameters to get something similar to your picture):
set.seed(101)
z <- rlnorm(1096,meanlog=4.5,sdlog=0.8)
Fit (on statistical grounds I wouldn't recommend a Poisson fit - it might be possible to adapt a discrete distribution such as Poisson (or better, negative binomial) to fit such continuous data, but log-Normal or Gamma distributions are more natural choices.
library(MASS)
f1 <- fitdistr(z,"lognormal")
f2 <- fitdistr(z,"Gamma")
The f1
and f2
objects, when printed, give the estimated coefficients (meanlog
and sdlog
for log-Normal, shape
and rate
for Gamma) and standard errors of the coefficients.
Draw a picture (on the density scale, not the count scale): red is log-Normal, blue is Gamma (in this case log-Normal fits better because that's how I generated the "data" in the first place). [The with(as.list(coef(...))
stuff is some R fanciness to allow the use of the names of the coefficients (meanlog
, sdlog
etc.) in the subsequent R code.]
hist(z,col="gray",breaks=50,freq=FALSE)
with(as.list(coef(f1)),
curve(dlnorm(x,meanlog,sdlog),
add=TRUE,col="red",lwd=2))
with(as.list(coef(f2)),
curve(dgamma(x,shape=shape,rate=rate),
add=TRUE,col="blue",lwd=2))