pythonscipydistributionscipy.stats

Seaborn's histplot doesn't match Matplotlib's hist when using log_scale=True


I'm probably making a stupid mistake, but here's how to reproduce:

  1. Generate random variables based on log-normal distribution
  2. Fit a log-normal distribution to the synthetic data
  3. Compute the probability distribution function using the fitted parameters
  4. Plot histogram of synthetic variables overlaid with the PDF
  5. They don't match!
import seaborn as sns
from scipy.stats import lognorm
import numpy as np

mu = 25
samples = lognorm.rvs(s=1, loc=0, scale=np.log(mu), size=10000)
shape, loc, scale = lognorm.fit(samples)
print(shape, loc, scale)

fig, ax = plt.subplots()
sns.histplot(samples, bins=50, stat="density", log_scale=True, ax=ax)
xs = np.linspace(0.1, 100, 10000)
ys = lognorm.pdf(xs, s=shape, loc=loc, scale=scale)
ax.plot(xs, ys, "r-")

enter image description here


Solution

  • I don't think the problem is with scipy.stats. Plotting with matplotlib, I see good agreement:

    from scipy.stats import lognorm
    import numpy as np
    import matplotlib.pyplot as plt
    
    mu = 25
    sample = lognorm.rvs(s=1, loc=0, scale=np.log(mu), size=100000)
    shape, loc, scale = lognorm.fit(sample)
    
    fig, ax = plt.subplots()
    ax.set_xscale('log')
    bins = np.logspace(-2, 2)
    ax.hist(sample, bins=bins, density=True)
    xs = np.logspace(-2, 2, 1000)
    ys = lognorm.pdf(xs, s=shape, loc=loc, scale=scale)
    ax.plot(xs, ys, "r-")
    

    enter image description here

    You can also plot the histogram of the log of the sample against the corresponding normal distribution with Seaborn.

    import seaborn as sns
    from scipy.stats import lognorm, norm
    import numpy as np
    
    mu = 25
    sample = lognorm.rvs(s=1, loc=0, scale=np.log(mu), size=100000)
    shape, loc, scale = lognorm.fit(sample)
    
    fig, ax = plt.subplots()
    sns.histplot(np.log(sample), bins=50, stat="density", ax=ax)
    xs = np.linspace(-2, 5, 1000)
    ys = norm.pdf(xs, loc=np.log(np.log(mu)), scale=1)
    ax.plot(xs, ys, "r-")
    

    enter image description here

    I suspect there's something about the interaction between 'density' and log_scale that is not correct, possibly in our understanding of seaborn.

    Update: see https://github.com/mwaskom/seaborn/issues/3579 for an explanation of what is going on. Apparently the density normalization is performed on the log-transformed data, even though the data is displayed with the original magnitudes on a log-scaled axis.

    If the shape of the seaborn histogram were preserved but the log-scaled axis were replaced by a linear-scaled axis with log-transformed labels, then the area under the curve would be 1.0.

    In other words, it can be thought of as a histogram of the log-transformed data but displayed with the original data magnitudes on a log-scaled axis.


    This doesn't account for the problem, but noticed that you defined mu=25 and passed scale=np.log(mu) to lognorm. Double check that this is what you mean to do against the documentation of lognorm.

    Suppose a normally distributed random variable X has mean mu and standard deviation sigma. Then Y = exp(X) is lognormally distributed with s = sigma and scale = exp(mu).