[SOLVED] What is '..density..' doing in Lets-Plot histogram

What is '..density..' doing in Lets-Plot histogram

My understanding of a density histogram is that the value associated with bin i:

with a frequency of f_i
a sample size of n
a bin width of w
density d_i = (f_i / n)/w

For this test case, the frequency and probability histograms are as I expect, but the density histogram isn't. Bin 1 for example has a frequency of 3. given n = 20 and w =.1 I would expect a density of .75 (which np.histogram calculates) but the plot indicates .84.

Is my understanding of density faulty, am i implementing it incorrectly, or is something broken?

This code demonstrates the problem. Lets-Plot is version 4.3.3

import numpy as np
import pandas as pd
from lets_plot import *
LetsPlot.setup_html()

rng = np.random.default_rng(12345)

test = rng.normal(0, 1, (20, 2))
df = pd.DataFrame(test, columns=['A', 'B'])

plot_1 = (
    ggplot(df, aes(x='A')) +
    geom_histogram(binwidth=.2, fill='pink') +
    ggtitle('Frequency')
)

plot_2 = (
    ggplot(df, aes(x='A')) +
    geom_histogram(aes(y='..density..'),binwidth=.2, fill='pink') +
    ggtitle('Density')
)

plot_3 = (
    ggplot(df, aes(x='A')) +
    geom_histogram(aes(weight=np.ones_like(df.A) / len(df.A)),binwidth=.2, fill='pink') +
    ggtitle('Probability')
)

# to calculate the same distribution using np.histogram, force an appropriate set of bins:
bins = np.arange(-1.56, 2.5, .2)
freq, edges = np.histogram(df.A,  bins=bins)
print('Frequency by bin calculated by np.histogram:')
print(f'{'Bin':^5s}{'Freq':^5s}')
for i,f in enumerate(freq):
    print(f'{i+1:^5d}{f:^5d}')
print('')
    
plot_1.show()
plot_3.show()

dens, edges = np.histogram(df.A, density='True', bins=bins)
print('Density by bin calculated by np.histogram:')
print(f'{'Bin':^5s}{'Density':^9s}')
for i,d in enumerate(dens):
    print(f'{i+1:^5d}{d:^9.2f}')
print('')
plot_2.show()

Solution

It looks like aes(y='..density..') is using the trapezium rule to calculate the area under the histogram when calculating the density, which is slightly underestimating the area underneath the histogram that you get from just summing the counts and multiplying by the bin width, e.g., if you do:

counts, edges = np.histogram(df.A, density=False, bins=bins)
print(counts / np.trapz(counts, edges[:-1] + 0.1))
[0.83333333 0.55555556 0.         0.55555556 0.27777778 0.27777778
 0.         0.55555556 0.         0.27777778 0.27777778 0.83333333
 0.27777778 0.         0.27777778 0.         0.27777778 0.
 0.         0.27777778]

you get the same values that you are seeing in the plot.