numpystatisticshistogram

Mean and median of distribution given numpy histogram


Suppose you have a numpy histogram computed from some data (which you don't have access to), so you only know bins and counts. Is there an efficient way of computing the mean and median of the distribution described by the histogram?


Solution

  • No you can't. After aggregation as histogram, the initial information is partially lost. You cannot exactly compute the mean/median of the original population.

    As a demonstration here are two different arrays (with different means/medians) that give the same counts and bins:

    a1 = np.array([10, 20, 100, 300, 310])
    np.mean(a1), np.median(a1)
    # (148.0, 100.0)
    
    a2 = np.array([10, 10, 130, 300, 310])
    np.mean(a2), np.median(a2)
    # (152.0, 130.0)
    
    np.histogram(a1, bins=2)
    # (array([3, 2]), array([ 10., 160., 310.])
    
    np.histogram(a2, bins=2)
    # (array([3, 2]), array([ 10., 160., 310.])
    

    Approximation

    You can however determine the limits of the mean:

    low = np.average(bins[:-1], weights=cnt)
    high = np.average(bins[1:], weights=cnt)
    print(f'The average is in the {low}-{high} range.')
    # The average is in the 70.0-220.0 range.
    

    And for the median:

    cnt_cumsum = np.add.accumulate(cnt)
    idx = np.searchsorted(cnt_cumsum, half)
    low = bins[idx]
    high = bins[idx+1]
    print(f'The median is in the {low}-{high} range.')
    # The median is in the 10.0-160.0 range.
    

    Example with 1000 random values and 20 bins:

    True data mean: 0.496, median: 0.481
    The average is in the 0.471-0.521 range.
    The median is in the 0.45-0.5 range.