TL;DR: How to plot the result of np.histogram(..., density=True)
correctly with Numpy?
Using density=True
should help to match the histogram of the sample, and the density function of the underlying random variable, but it doesn't:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
y = np.random.randn(10000)
h, bins = np.histogram(y, bins=1000, density=True)
plt.bar(bins[:-1], h)
x = np.linspace(-10, 10, 100)
f = scipy.stats.norm.pdf(x)
plt.plot(x, f, color="green")
plt.show()
Why aren't the histogram and probability density functions scaled accordingly?
In this case, an observation shows that a 1.6 scaling would be better:
plt.plot(x, 1.6 * f, color="green")
Also, this works normally:
plt.hist(y, bins=100, density=True)
Why?
.bar()
is too big (.hist()
adjusts width internally).In Axes.hist
, bar widths are computed by np.diff(bins)
(source code). Since it allows a multi-dimensional array, there are a whole lot of validation and reshaping done behind the scenes but if we set all that aside, for a 1D array, Axes.hist
is just a wrapper around np.histogram
and Axes.bar
whose (abridged) code looks like the following:
height, bins = np.histogram(x, bins)
width = np.diff(bins)
boffset = 0.5 * width
Axes.bar(bins[:-1]+boffset, height, width)
On the other hand, Axes.bar
iteratively adds matplotlib.patches.Rectangle
objects to an Axes using a default width of 0.8, (source implementation), so if a bar is particularly tall and subsequent bars are short, the short ones will be hidden behind the tall ones.
The following code (kind of) illustrates the points made above. The histograms are the same; for example the tallest bars are the same. Note that, in the figure below the figsize is 12"x5" and each Axes is roughly 3" wide), so given that the default dpi is 100, it can only show roughly 300 dots horizontally which means it cannot show all 1000 bars correctly. We need an appropriately wide figure to show all bars correctly.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# sample
y = np.random.default_rng(0).standard_normal(10000)
N = 1000
h, bins = np.histogram(y, bins=N, density=True)
x = np.linspace(-5, 5, 100)
f = stats.norm.pdf(x)
# figure
fig, axs = plt.subplots(1, 3, figsize=(12,5))
a0 = axs[0].bar(bins[:-1], h)
a1 = axs[1].bar(bins[:-1]+0.5*np.diff(bins), h, np.diff(bins))
h_hist, bins_hist, a2 = axs[2].hist(y, bins=N, density=True)
for a, t in zip(axs, ['Axes.bar with default width', 'Axes.bar with width\nrelative to bins', 'Axes.hist']):
a.plot(x, f, color="green")
a.set_title(t)
# label tallest bar of each Axes
axs[0].bar_label(a0, [f'{h.max():.2f}' if b.get_height() == h.max() else '' for b in a0], fontsize=7)
axs[1].bar_label(a1, [f'{h.max():.2f}' if b.get_height() == h.max() else '' for b in a1], fontsize=7)
axs[2].bar_label(a2, [f'{h_hist.max():.2f}' if b.get_height() == h_hist.max() else '' for b in a2], fontsize=7)
# the bin edges and heights from np.histogram and Axes.hist are the same
(h == h_hist).all() and (bins == bins_hist).all() # True
For example, if we plot the same figure with figsize=(60,5)
(everything else the same), we get the following figure where the bars are correctly shown (in particular Axes.hist
and Axes.bar
with adjusted widths are the same).