[SOLVED] Seaborn's histplot doesn't match Matplotlib's hist when using log

Seaborn's histplot doesn't match Matplotlib's hist when using log_scale=True

I'm probably making a stupid mistake, but here's how to reproduce:

Generate random variables based on log-normal distribution
Fit a log-normal distribution to the synthetic data
Compute the probability distribution function using the fitted parameters
Plot histogram of synthetic variables overlaid with the PDF
They don't match!

import seaborn as sns
from scipy.stats import lognorm
import numpy as np

mu = 25
samples = lognorm.rvs(s=1, loc=0, scale=np.log(mu), size=10000)
shape, loc, scale = lognorm.fit(samples)
print(shape, loc, scale)

fig, ax = plt.subplots()
sns.histplot(samples, bins=50, stat="density", log_scale=True, ax=ax)
xs = np.linspace(0.1, 100, 10000)
ys = lognorm.pdf(xs, s=shape, loc=loc, scale=scale)
ax.plot(xs, ys, "r-")

Solution

I don't think the problem is with scipy.stats. Plotting with matplotlib, I see good agreement:

from scipy.stats import lognorm
import numpy as np
import matplotlib.pyplot as plt

mu = 25
sample = lognorm.rvs(s=1, loc=0, scale=np.log(mu), size=100000)
shape, loc, scale = lognorm.fit(sample)

fig, ax = plt.subplots()
ax.set_xscale('log')
bins = np.logspace(-2, 2)
ax.hist(sample, bins=bins, density=True)
xs = np.logspace(-2, 2, 1000)
ys = lognorm.pdf(xs, s=shape, loc=loc, scale=scale)
ax.plot(xs, ys, "r-")

You can also plot the histogram of the log of the sample against the corresponding normal distribution with Seaborn.

import seaborn as sns
from scipy.stats import lognorm, norm
import numpy as np

mu = 25
sample = lognorm.rvs(s=1, loc=0, scale=np.log(mu), size=100000)
shape, loc, scale = lognorm.fit(sample)

fig, ax = plt.subplots()
sns.histplot(np.log(sample), bins=50, stat="density", ax=ax)
xs = np.linspace(-2, 5, 1000)
ys = norm.pdf(xs, loc=np.log(np.log(mu)), scale=1)
ax.plot(xs, ys, "r-")

I suspect there's something about the interaction between 'density' and log_scale that is not correct, possibly in our understanding of seaborn.

Update: see https://github.com/mwaskom/seaborn/issues/3579 for an explanation of what is going on. Apparently the density normalization is performed on the log-transformed data, even though the data is displayed with the original magnitudes on a log-scaled axis.

If the shape of the seaborn histogram were preserved but the log-scaled axis were replaced by a linear-scaled axis with log-transformed labels, then the area under the curve would be 1.0.

In other words, it can be thought of as a histogram of the log-transformed data but displayed with the original data magnitudes on a log-scaled axis.

This doesn't account for the problem, but noticed that you defined mu=25 and passed scale=np.log(mu) to lognorm. Double check that this is what you mean to do against the documentation of lognorm.

Suppose a normally distributed random variable X has mean mu and standard deviation sigma. Then Y = exp(X) is lognormally distributed with s = sigma and scale = exp(mu).