pythonhistogramlets-plot

What is '..density..' doing in Lets-Plot histogram


My understanding of a density histogram is that the value associated with bin i:

For this test case, the frequency and probability histograms are as I expect, but the density histogram isn't. Bin 1 for example has a frequency of 3. given n = 20 and w =.1 I would expect a density of .75 (which np.histogram calculates) but the plot indicates .84.

Is my understanding of density faulty, am i implementing it incorrectly, or is something broken?

This code demonstrates the problem. Lets-Plot is version 4.3.3

import numpy as np
import pandas as pd
from lets_plot import *
LetsPlot.setup_html()

rng = np.random.default_rng(12345)

test = rng.normal(0, 1, (20, 2))
df = pd.DataFrame(test, columns=['A', 'B'])

plot_1 = (
    ggplot(df, aes(x='A')) +
    geom_histogram(binwidth=.2, fill='pink') +
    ggtitle('Frequency')
)

plot_2 = (
    ggplot(df, aes(x='A')) +
    geom_histogram(aes(y='..density..'),binwidth=.2, fill='pink') +
    ggtitle('Density')
)

plot_3 = (
    ggplot(df, aes(x='A')) +
    geom_histogram(aes(weight=np.ones_like(df.A) / len(df.A)),binwidth=.2, fill='pink') +
    ggtitle('Probability')
)

# to calculate the same distribution using np.histogram, force an appropriate set of bins:
bins = np.arange(-1.56, 2.5, .2)
freq, edges = np.histogram(df.A,  bins=bins)
print('Frequency by bin calculated by np.histogram:')
print(f'{'Bin':^5s}{'Freq':^5s}')
for i,f in enumerate(freq):
    print(f'{i+1:^5d}{f:^5d}')
print('')
    
plot_1.show()
plot_3.show()

dens, edges = np.histogram(df.A, density='True', bins=bins)
print('Density by bin calculated by np.histogram:')
print(f'{'Bin':^5s}{'Density':^9s}')
for i,d in enumerate(dens):
    print(f'{i+1:^5d}{d:^9.2f}')
print('')
plot_2.show()

Frequency calculated by numpy Lets-Plot output frequency, probability, numpy density Lets-Plot density


Solution

  • It looks like aes(y='..density..') is using the trapezium rule to calculate the area under the histogram when calculating the density, which is slightly underestimating the area underneath the histogram that you get from just summing the counts and multiplying by the bin width, e.g., if you do:

    counts, edges = np.histogram(df.A, density=False, bins=bins)
    print(counts / np.trapz(counts, edges[:-1] + 0.1))
    [0.83333333 0.55555556 0.         0.55555556 0.27777778 0.27777778
     0.         0.55555556 0.         0.27777778 0.27777778 0.83333333
     0.27777778 0.         0.27777778 0.         0.27777778 0.
     0.         0.27777778]
    

    you get the same values that you are seeing in the plot.