pythonmatplotlibhistogramdpiantialiasing

Why are histograms incorrectly displayed when the distribution is tightly clustered?


I'm trying to display some data in a histogram, but when the data is too tightly clustered, I get either an empty graph, or one I believe to be inaccurate.

Consider the following code:

import numpy as np
from matplotlib import pyplot as plt

# Generate data
nums = np.random.rand(1000)+1000

# Make Histogram
plt.hist(nums, bins=1000, alpha=0.6, color='blue')  
plt.xlim([900,1100])
plt.yscale('linear')
plt.grid(True)
plt.show()

This give the following graph:

Plot 1

However, if I change the xlim values to:

plt.xlim([990,1010])

I get:

Plot 2

If I change it, yet again, to

plt.xlim([999,1001])

I get

Plot 3

With each bin covering a smaller range of numbers, I would've expected the peaks of the bins to decrease, rather than increase. Is there something I'm not understanding here, or is this a problem with matplotlib? (Note: This seems very similar to Empty histogram in matplotlib - data in small interval, but I think I've laid out the problem more explicitly and noticed an additional problem even when the resulting plots are not blank (i.e. highest value of a bin was greater for the narrower bins of my 3rd plot than it was for the second)


Solution

  • When dealing with random data it's always a good idea to set a seed in order to guarantee you're dealing with the same data across each run.

    import numpy as np
    from matplotlib import pyplot as plt
    
    # Generate data
    np.random.seed(1000)
    nums = np.random.rand(1000)+1000
    

    Even when playing with the exact same data we face the same problem you have stated. To illustrate that I'll show my plot with four different x-axis limits:

    fig, axs = plt.subplots(2, 2)
    
    bins = 100
    
    n0, bins0, patches0 = axs[0,0].hist(nums, bins=bins, color='blue')
    axs[0,0].set_xlim([900,1100])
    axs[0,0].grid()
    
    n1, bins1, patches1 = axs[0,1].hist(nums, bins=bins, color='blue')
    axs[0,1].set_xlim([990,1010])
    axs[0,1].grid()
    
    n2, bins2, patches2 = axs[1,0].hist(nums, bins=bins, color='blue')
    axs[1,0].set_xlim([999,1002])
    axs[1,0].grid()
    
    n3, bins3, patches3 = axs[1,1].hist(nums, bins=bins, color='blue')
    axs[1,1].set_xlim([999.9,1001.1])
    axs[1,1].grid()
    
    plt.show()
    

    Subplots - low resolution

    And yeah, the problem appears even with 100 bins instead of 1000. But if you check the information within the n and the bins, they are all the same (as they should be, since the plot is exactly the same).

    It's possible to check that by doing

    print((n0 == n1).all() and (n0 == n2).all() and (n0 == n3).all())
    # True
    

    If the data are the same but the plots are not, it seems pretty much a low resolution problem. Here you can see two more pictures:

    You can try this out with greater resolutions in your local PC, but this actually solves the case.

    With low resolution figures there are more bins to plot than pixels beeing used, so matplotlib must be doing some kind of data sampling. That's why the plot changes everytime you zoom it in or out.
    If you set a higher size for your plot, than it'll be able to show a larger number of bins, leading your plot to be more trustworthy.