algorithmheapmedian

Find running median from a stream of integers


Possible Duplicate:
Rolling median algorithm in C

Given that integers are read from a data stream. Find median of elements read so far in efficient way.

Solution I have read: We can use a max heap on left side to represent elements that are less than the effective median, and a min heap on right side to represent elements that are greater than the effective median.

After processing an incoming element, the number of elements in heaps differ at most by 1 element. When both heaps contain the same number of elements, we find the average of heap's root data as effective median. When the heaps are not balanced, we select the effective median from the root of heap containing more elements.

But how would we construct a max heap and min heap i.e. how would we know the effective median here? I think that we would insert 1 element in max-heap and then the next 1 element in min-heap, and so on for all the elements. Correct me If I am wrong here.


Solution

  • There are a number of different solutions for finding running median from streamed data, I will briefly talk about them at the very end of the answer.

    The question is about the details of the a specific solution (max heap/min heap solution), and how heap based solution works is explained below:

    For the first two elements add smaller one to the maxHeap on the left, and bigger one to the minHeap on the right. Then process stream data one by one,

    Step 1: Add next item to one of the heaps
    
       if next item is smaller than maxHeap root add it to maxHeap,
       else add it to minHeap
    
    Step 2: Balance the heaps (after this step heaps will be either balanced or
       one of them will contain 1 more item)
    
       if number of elements in one of the heaps is greater than the other by
       more than 1, remove the root element from the one containing more elements and
       add to the other one
    

    Then at any given time you can calculate median like this:

       If the heaps contain equal amount of elements;
         median = (root of maxHeap + root of minHeap)/2
       Else
         median = root of the heap with more elements
    

    Now I will talk about the problem in general as promised in the beginning of the answer. Finding running median from a stream of data is a tough problem, and finding an exact solution with memory constraints efficiently is probably impossible for the general case. On the other hand, if the data has some characteristics we can exploit, we can develop efficient specialized solutions. For example, if we know that the data is an integral type, then we can use counting sort, which can give you a constant memory constant time algorithm. Heap based solution is a more general solution because it can be used for other data types (doubles) as well. And finally, if the exact median is not required and an approximation is enough, you can just try to estimate a probability density function for the data and estimate median using that.