algorithmoptimizationdata-structurespercentile

Fast algorithm for repeated calculation of percentile?


In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

  1. Get value x
  2. Insert x in an already sorted array at the back
  3. swap x down until the array is sorted
  4. Read the element at position array[array.size * 3/4]

Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?


Solution

  • You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

    First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

    1. Adding element.

    See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
    Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
    Similar if heap B became too big.

    1. Finding "0.75 median"

    Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

    edit
    As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).