I'm trying to check if my data is distributed according to some distribution, however I always get NLLF values that doesn't make sense.
For example, here I'm generating 10K data points using a normal distribution, checking the NLLF loss on the distribution it was generated from. Then, I'm fitting the data to a log-normal distribution and checking the NLLF loss on it.
mu, sigma = 0, 1
X = scipy.stats.norm(loc=mu, scale=sigma)
x = X.rvs(size=10000)
print('Norm NLLF:', scipy.stats.norm.nnlf((mu, sigma), x))
params = scipy.stats.lognorm.fit(x)
print('LogNorm NLLF:', scipy.stats.lognorm.nnlf(params, x))
The result is:
Norm NLLF: 14369.799291446736
LogNorm NLLF: 14366.683866496474
Which is weird since LogNorm has a lower NLLF loss than Norm, but they are actually very close to each other.
There are three factors causing this.
First, lognorm doesn't always beat norm here. I observed it seems to beat it around 50% of the time.
Second, the mean and standard deviation of 10K random numbers will be slightly different from the theoretical values, so lognorm has an advantage because it's fitting on the actual dataset, not the theoretical properties of that dataset.
To make an apples-to-apples comparison, I suggest fitting norm and lognorm at the same time.
mu, sigma = 0, 1
X = scipy.stats.norm(loc=mu, scale=sigma)
x = X.rvs(size=10000)
params_norm = scipy.stats.norm.fit(x)
print('Norm NLLF:', scipy.stats.norm.nnlf(params_norm, x))
params = scipy.stats.lognorm.fit(x)
print('LogNorm NLLF:', scipy.stats.lognorm.nnlf(params, x))
print('Norm params', params_norm)
You'll see two things from this:
Third, for some values of s
, lognorm is extremely similar to norm.
Here is a plot of the lognorm's PDF for various values of s
.
Identical parameter μ but differing parameters σ - Image Credit: Wikipedia
Note: Wikipedia calls the this variable σ, but the SciPy docs call this s
.
The thing to notice about this is that as s
gets closer and closer to zero, three things happen: the distribution looks more and more like the normal distribution, the peak shifts to the right, and the peak gets narrower. However, all SciPy distributions can be shifted and scaled using the loc
and scale
parameters. This lets lognorm fix the last two issues caused by having s
so small.
The result is a probability distribution that nearly exactly matches the normal distribution.
import scipy
import matplotlib.pyplot as plt
import numpy as np
mu, sigma = 0, 1
norm_dist = scipy.stats.norm(loc=mu, scale=sigma)
x = norm_dist.rvs(size=5000)
params = scipy.stats.lognorm.fit(x)
lognorm_dist = scipy.stats.lognorm(*params)
x = np.linspace(-2, 2, 100)
plt.plot(x, norm_dist.pdf(x), label='norm')
plt.plot(x, lognorm_dist.pdf(x), label='lognorm')
plt.legend()
plt.ylim(ymin=0)
plt.xlabel('x')
plt.ylabel('pdf')
Output:
In summary, there are three factors:
s
.In conclusion, in some cases it's going to be really hard to tell if your data is distributed in a normal distribution, or a lognormal distribution with a normal-like shape.