rrandomdistributioncurve-fittingpower-law

Drawing random numbers from a power law distribution in R


I am using the R package "poweRlaw" to estimate and subsequently draw from discrete power law distributions, however the distribution drawn from the fit does not seem to match the data. To illustrate, consider this example from a guide for this package: https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf. Here we first download an example dataset from the package and then fit a discrete power law.

library("poweRlaw")
data("moby", package = "poweRlaw")

m_pl = displ$new(moby)
est = estimate_xmin(m_pl)
m_pl$setXmin(est)

The fit looks like a good one, as we can't discard the hypothesis that this data is drawn from a power distribution (p-value > 0.05):

bs = bootstrap_p(m_pl, threads = 8)
bs$p

However, when we draw from this distribution using the built in function dist_rand(), the resulting distribution is shifted to the right of the original distribution:

set.seed(1)
randNum = dist_rand(m_pl, n = length(moby))

plot(density(moby), xlim = c(0, 1000), ylim = c(0, 1), xlab = "", ylab = "", main = "")
par(new=TRUE)
plot(density(randNum), xlim = c(0, 1000), ylim = c(0, 1), col = "red", xlab = "x", ylab = "Density", main = "")

Resulting Plot, Red = Distribution Drawn from Fit

I am probably misunderstanding what it means to draw from a power distribution, but does this happen because we only fit the tail of the experimental distribution (so we draw after the parameter Xmin)? If something like this is happening, is there any way I can compensate for this fact so that the fitted distribution resembles the experimental distribution?


Solution

  • So there's a few things going on here.

    1. As you hinted at in your question, if you want to compare distributions, you need to truncate moby, so moby = moby[moby >= m_pl$getXmin()]

    2. Using density() is a bit fraught. This is a kernel density smoother, that draws Normal distributions over discrete points. As the powerlaw has a very long tail, this is suspect

    3. Comparing the tails of two powerlaw distributions is tricky (simulate some data and see).

    Anyway, if you run

    set.seed(1)
    x = dist_rand(m_pl, n = length(moby))
    # Cut off the tail for visualisation
    moby = moby[moby >= m_pl$getXmin() & moby < 100]
    plot(density(moby),  log = "xy")
    x = x[ x < 100]
    lines(density(x),  col = 2)
    

    Gives something fairly similar.