pythonmatplotlibscipystatisticsbernoulli-probability

Plot density histogram of Bernoulli sample and a Bernoulli pmf together


Summary of Question:

Why is my density from my sample so different to the pmf and how can I perform this simulation so that the pmf and the sample estimates are similar.

Question:

I have simulated a sample of independent Bernoulli trials using scipy. I am now trying to take a density histogram of the sample that I created and compare it to the pmf (probability mass function). I would expect the density histogram to show two bins each hovering near the pmf but instead, I have 2 bins above the pmf values at 5. Could someone please show me how to create a density histogram that does not do this for the Bernoulli? I tried a similar simulation with a few other distributions and it seemed to work fine. What am I missing here and could you show me how to manipulate my code to make this work?

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

trials = 10**3
p = 0.5


sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
plt.plot((0,1), stats.bernoulli.pmf((0,1), p), 'bo', ms=8, label='bernoulli pmf')

# Density histogram of generated values
plt.hist(sample_bernoulli, density=True, alpha=0.5, color='steelblue', edgecolor='none')
plt.show()

enter image description here

I must apologize if this is a simple or trivial question but I couldn't find a solution online and found the issue interesting. Any help at all would be appreciated.


Solution

  • The reason is that plt.hist is primarily meant to work with continuous distributions. If you don't provide explicit bin boundaries, plt.hist just creates 10 equally spaced bins between the minimum and maximum value. Most of these bins will be empty. With only two possible data values, there should be just two bins, so 3 boundaries:

    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    
    trials = 10**3
    p = 0.5
    
    sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
    plt.plot((0,1), stats.bernoulli.pmf((0,1), p), 'bo', ms=8, label='bernoulli pmf')
    
    # Density histogram of generated values
    plt.hist(sample_bernoulli, density=True, alpha=0.5, color='steelblue', edgecolor='none', bins=np.linspace(-0.5, 1.5, 3))
    plt.show()
    

    example plot

    Here is a visualization of the default bin boundaries and how the samples fit into the bins. Note that with density=True, the histogram is normalized such that the area of all the bars sums to 1. In this case two bars are 0.1 wide and about 5.0 high, while 8 others have height zero. So, the total area is 2*0.1*5 + 8*0.0 = 1.

    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    
    trials = 10 ** 3
    p = 0.5
    
    sample_bernoulli = stats.bernoulli.rvs(p, size=trials)  # Generate benoulli RV
    
    # Density histogram of generated values with default bins
    values, binbounds, bars = plt.hist(sample_bernoulli, density=True, alpha=0.2, color='steelblue', edgecolor='none')
    # show the bin boundaries
    plt.vlines(binbounds, 0, max(values) * 1.05, color='crimson', ls=':')
    # show the sample values with a random displacement
    plt.scatter(sample_bernoulli * 0.9 + np.random.uniform(0, 0.1, trials),
                np.random.uniform(0, max(values), trials), color='lime')
    # show the index of each bin
    for i in range(len(binbounds) - 1):
        plt.text((binbounds[i] + binbounds[i + 1]) / 2, max(values) / 2, i, ha='center', va='center', fontsize=20, color='crimson')
    plt.show()
    

    showing bin boundaries