rdistributiondata-fittingpower-law

Difference in likelihood functions for continuous vs discrete lognormal distributions in R's poweRlaw package


I'm trying to fit a lognormal distribution to some count data using Colin Gillespie's poweRlaw package in R. I'm aware that the lognormal distribution is continuous and count data is discrete, however, the package contains classes and methods for both continuous and discrete versions of the lognormal distribution.

When I fit xmin (threshold below which count values are disregarded), log mean and log sd parameters and bootstrap the results to get a p value, I get a vector memory exhaustion error. I found that this happens when the package-internal function sample_p_helper tries to generate random numbers from the fitted distribution. The fitted log mean and log sd parameters are so low that the rejection sampling algorithm tries to generate literally billions of numbers to get anything above xmin, hence the memory issue.

Input:

library(poweRlaw)

counts <- c(54, 64, 126, 161, 162, 278, 281, 293, 296, 302, 322, 348, 418, 511, 696, 793, 1894)

dist <- dislnorm$new(counts)      # Create discrete lnorm object
dist$setXmin(estimate_xmin(dist)) # Get xmin and parameters

bs <- bootstrap_p(dist) # Run bootstrapping

Error message:

Expected total run time for 100 sims, using 1 threads is 24.6 seconds.
Error in checkForRemoteErrors(val) : 
  one node produced an error: vector memory exhausted (limit reached?)

The question then becomes why such low and poor-fitting log mean and log sd parameter values are being fitted in the first place.

I noticed that if I fit the continuous version of the lognormal distribution, the error does not occur and the parameter values seem more reasonable (in fact, the p value suggests the data are compatible with the lognormal distribution):

dist_cont <- conlnorm$new(counts)
dist_cont$setXmin(estimate_xmin(dist_cont))

bs <- bootstrap_p(dist_cont)

bs

Looking at the source code for the package, I noticed the likelihood functions for the discrete vs continuous lognormal distributions are different. Specifically, the part where joint probability is calculated.

The continuous version looks how I'd expect:

########################################################
#Log-likelihood 
########################################################
conlnorm_tail_ll = function(x, pars, xmin) {
  if(is.vector(pars)) pars = t(as.matrix(pars))
  n = length(x)
  joint_prob = colSums(apply(pars, 1, 
                             function(i) dlnorm(x, i[1], i[2], log=TRUE)))

  prob_over = apply(pars, 1, function(i) 
    plnorm(xmin, i[1], i[2], log.p=TRUE, lower.tail=FALSE))
  joint_prob - n*prob_over

}

However, in the discrete version, joint probability is calculated differently:

########################################################
#Log-likelihood 
########################################################
dis_lnorm_tail_ll = function(xv, xf, pars, xmin) {
  if(is.vector(pars)) pars = t(as.matrix(pars))
  n = sum(xf)
  p = function(par) {
    m_log = par[1]; sd_log = par[2]
    plnorm(xv-0.5, m_log, sd_log, lower.tail=FALSE) - 
      plnorm(xv+0.5, m_log, sd_log, lower.tail=FALSE)
  }
  if(length(xv) == 1L) {
    joint_prob = sum(xf * log(apply(pars, 1, p)))
  } else {
    joint_prob = colSums(xf * log(apply(pars, 1, p)))
  }
  prob_over = apply(pars, 1, function(i) 
    plnorm(xmin-0.5, i[1], i[2], 
           lower.tail = FALSE, log.p = TRUE))

  return(joint_prob - n*prob_over)
}  

There's a similar difference between discrete and continuous implementations of the exponential distribution, but not the discrete and continuous power law distributions. In the continuous version, joint_prob is calculated with a relatively simple call to dlnorm, but the discrete versions call plnorm instead. Further, they call plnorm twice, first on the observed data values -0.5 then on the observed values +0.5 and subtract the former from the latter.

So, at last, my questions:

Thanks for bearing with me through this lengthy post. I'm aware this is as much a statistics question as a coding question. Any helpful nudges in the right direction are very much appreciated! Cheers.


Solution

    1. dlnorm() gives the probability density value. Remember densities integrate to one but don't sum to one. So to work out the discrete distribution we take the values either side of an integer. They'll be a normalising constant as well. For the CTN case, the log-likelihood is just a product of dlnorm(), which is easier and faster.

    2. "Safe" is a hard word to define. For this data, the CTN and discrete give visually the same fit. But neither fit well.

    3. The estimated parameters values for the discrete distribution gives a truncated lognormal in the very extreme tails. Simulating data in that region is challenging

    4. Yep, your data is the problem. But that's also the issue when the model doesn't work ;)