pythonnumpyhistogram

Why does the last bin in a NumPy histogram have an unusually high count?


This Python 3.12.7 script with numpy 2.2.4:

import numpy as np

a = np.random.randint(0, 256, (500, 500)).astype(np.uint8)
counts, bins = np.histogram(a, range(0, 255, 25))
print(np.column_stack((counts, bins[:-1], bins[1:])))  
counts, bins = np.histogram(a, range(0, 257, 16))
print(np.column_stack((counts, bins[:-1], bins[1:])))  

produces this kind of output:

[[24721     0    25]
 [24287    25    50]
 [24413    50    75]
 [24441    75   100]
 [24664   100   125]
 [24390   125   150]
 [24488   150   175]
 [24355   175   200]
 [24167   200   225]
 [25282   225   250]]
[[15800     0    16]
 [15691    16    32]
 [15640    32    48]
 [15514    48    64]
 [15732    64    80]
 [15506    80    96]
 [15823    96   112]
 [15724   112   128]
 [15629   128   144]
 [15681   144   160]
 [15661   160   176]
 [15558   176   192]
 [15526   192   208]
 [15469   208   224]
 [15772   224   240]
 [15274   240   256]]

where the first histogram always has the highest count in bin [225, 250). The second histogram indicates a uniform distribution, as expected. I tried a dozen of times and the anomaly was always there. Can someone explain this behavior?


Solution

  • I think the docs explain pretty well what's happening, but are spread out in two different places. First, the range range(0, 255, 25) is supplying the bins parameter, not the range parameter. Secondly, the Notes section states:

    All but the last (righthand-most) bin is half-open. In other words, if bins is:

    [1, 2, 3, 4]
    

    then the first bin is [1,2) (including 1, but excluding 2) and the second [2,3). The last bin, however, is [3,4], which includes 4.

    Pretty sure the extra counts in your case are the number of elements that equal 250. This makes sense, since the increase is about 1/25th of the bin size compared to the other bins, which all have a width of 25.