pythonscipydistribution

Scipy NLLF Loss has high values for distribution fitting


I'm trying to check if my data is distributed according to some distribution, however I always get NLLF values that doesn't make sense.

For example, here I'm generating 10K data points using a normal distribution, checking the NLLF loss on the distribution it was generated from. Then, I'm fitting the data to a log-normal distribution and checking the NLLF loss on it.

mu, sigma = 0, 1
X = scipy.stats.norm(loc=mu, scale=sigma)
x = X.rvs(size=10000)
print('Norm NLLF:', scipy.stats.norm.nnlf((mu, sigma), x))

params = scipy.stats.lognorm.fit(x)
print('LogNorm NLLF:', scipy.stats.lognorm.nnlf(params, x))

The result is:

Norm NLLF: 14369.799291446736
LogNorm NLLF: 14366.683866496474

Which is weird since LogNorm has a lower NLLF loss than Norm, but they are actually very close to each other.


Solution

  • There are three factors causing this.


    First, lognorm doesn't always beat norm here. I observed it seems to beat it around 50% of the time.


    Second, the mean and standard deviation of 10K random numbers will be slightly different from the theoretical values, so lognorm has an advantage because it's fitting on the actual dataset, not the theoretical properties of that dataset.

    To make an apples-to-apples comparison, I suggest fitting norm and lognorm at the same time.

    mu, sigma = 0, 1
    X = scipy.stats.norm(loc=mu, scale=sigma)
    x = X.rvs(size=10000)
    params_norm = scipy.stats.norm.fit(x)
    print('Norm NLLF:', scipy.stats.norm.nnlf(params_norm, x))
    
    params = scipy.stats.lognorm.fit(x)
    print('LogNorm NLLF:', scipy.stats.lognorm.nnlf(params, x))
    print('Norm params', params_norm)
    

    You'll see two things from this:

    1. The mu and sigma values are not 0 and 1. They will vary a little bit, because of randomness.
    2. This significantly reduces the number of cases where lognorm wins. I found that it got a lower NLLF about 20% of the time.

    Third, for some values of s, lognorm is extremely similar to norm.

    Here is a plot of the lognorm's PDF for various values of s.

    lognorm pdf plot

    Identical parameter μ but differing parameters σ - Image Credit: Wikipedia

    Note: Wikipedia calls the this variable σ, but the SciPy docs call this s.

    The thing to notice about this is that as s gets closer and closer to zero, three things happen: the distribution looks more and more like the normal distribution, the peak shifts to the right, and the peak gets narrower. However, all SciPy distributions can be shifted and scaled using the loc and scale parameters. This lets lognorm fix the last two issues caused by having s so small.

    The result is a probability distribution that nearly exactly matches the normal distribution.

    import scipy
    import matplotlib.pyplot as plt
    import numpy as np
    
    
    mu, sigma = 0, 1
    norm_dist = scipy.stats.norm(loc=mu, scale=sigma)
    x = norm_dist.rvs(size=5000)
    params = scipy.stats.lognorm.fit(x)
    lognorm_dist = scipy.stats.lognorm(*params)
    x = np.linspace(-2, 2, 100)
    plt.plot(x, norm_dist.pdf(x), label='norm')
    plt.plot(x, lognorm_dist.pdf(x), label='lognorm')
    plt.legend()
    plt.ylim(ymin=0)
    plt.xlabel('x')
    plt.ylabel('pdf')
    

    Output:

    very similar norm and lognorm

    In summary, there are three factors:

    1. Random chance.
    2. The sample mean/standard deviation and distribution mean/standard deviation are not always the same.
    3. Lognormal is very normal-like for some values of s.

    In conclusion, in some cases it's going to be really hard to tell if your data is distributed in a normal distribution, or a lognormal distribution with a normal-like shape.