r statistics normal-distribution poisson model-fitting

Fitting a lognormal or poisson distribution

I have a vector of 1096 numbers, the daily average concentration of NOx measured in 3 years in a measurement station. You can observe the type of distribution in the image:

Histogram of NOx concentration

I used these commands to do the histogram:

NOxV<-scan("NOx_Vt15-17.txt")
hist.NOxVt<-hist(NOxV, plot = FALSE, breaks = 24) 
plot(hist.NOxVt, xlab = "[NOx]", ylab = "Frequenze assolute", main = "Istogramma freq. ass. NOx 15-17 Viterbo")
points(hist.NOxVt$mids, hist.NOxVt$counts, col= "red")

My professor suggested that I fit the histogram with a Poisson distribution - paying attention to the transition: discrete -> continuous (I don't know what that means)- or with a "Lognormal" distribution.

I tried to do the Poisson fit, with some command lines that she gave us at lesson, but R gave me an error after having executed the last code line of the following:

  my_poisson = function(params, x){
      exp(-params)*params^x/factorial(x)
  }
    
  y<-hist.NOxVt$counts/1096;
  x<-hist.NOxVt$mids;
  z <- nls( y ~ exp(-a)*a^x/factorial(x), start=list(a=1) )

Error in numericDeriv(form[[3L]], names(ind), env) : Missing value or an infinity produced when evaluating the model In addition: There were 50 or more warnings (use warnings() to see the first 50)"

After this problem I couldn't solve (even searching similar problems on the internet) I decided to fit the distribution with a Lognormal, but I have no idea how to do it, because the professor did not explain it to us, and I still don't have enough R experience to figure it out on my own.

Solution

There's a built-in function fitdistr in the MASS package that comes with R:

Generating a data example to look at (eyeballing parameters to get something similar to your picture):

set.seed(101)
z <- rlnorm(1096,meanlog=4.5,sdlog=0.8)

Fit (on statistical grounds I wouldn't recommend a Poisson fit - it might be possible to adapt a discrete distribution such as Poisson (or better, negative binomial) to fit such continuous data, but log-Normal or Gamma distributions are more natural choices.

library(MASS)
f1 <- fitdistr(z,"lognormal")
f2 <- fitdistr(z,"Gamma")

The f1 and f2 objects, when printed, give the estimated coefficients (meanlog and sdlog for log-Normal, shape and rate for Gamma) and standard errors of the coefficients.

Draw a picture (on the density scale, not the count scale): red is log-Normal, blue is Gamma (in this case log-Normal fits better because that's how I generated the "data" in the first place). [The with(as.list(coef(...)) stuff is some R fanciness to allow the use of the names of the coefficients (meanlog, sdlog etc.) in the subsequent R code.]

hist(z,col="gray",breaks=50,freq=FALSE)
with(as.list(coef(f1)),
     curve(dlnorm(x,meanlog,sdlog),
           add=TRUE,col="red",lwd=2))
with(as.list(coef(f2)),
     curve(dgamma(x,shape=shape,rate=rate),
           add=TRUE,col="blue",lwd=2))