arraysalgorithmbinary-searchdivide-and-conquer

Proof of correctness for algorithm to find the median of the union of two sorted arrays


I'm trying to solve the following problem:

Suppose you are given two sorted arrays A[1 .. m] and B[1 .. n] and an integer k. Describe a fast algorithm to find the kth smallest element in the union A ∪ B. For example, given the input A[1 .. 8] = [0,1,6,9,12,13,18,20], B[1 .. 5] = [2,5,7,17,19], and k = 6, your algorithm should return the integer 7.

I've found many solutions to this online, but none of them made much sense to me. I found an answer that seems very interesting, but I'm having trouble understanding why the algorithm works.

First, the author provides this algorithm to find the median of the union of two arrays:

Median(A[1 .. n],B[1...n]) :
  if n < 100:
    use brute force
  else if A[n/2] > B[n/2]:
    return Median(A[1 . . n/2], B[n/2 + 1 . . n])
  else
    return Median(A[n/2 + 1 . . n], B[1 . . n/2])

The proof for this is:

Suppose A[n/2] > B[n/2]. Then A[n/2+1] is larger than all n elements in A[1 .. n/2]∪B[1 .. n/2], and therefore larger than the median of A ∪ B, so we can discard the upper half of A. Similarly, B[n/2 − 1] is smaller than all n + 1 elements of A[n/2 . . n] ∪ B[n/2 + 1 . . n], and therefore smaller than the median of A ∪ B, so we can discard the lower half of B. Because we discard the same number of elements from each array, the median of the remaining subarrays is the median of the original A ∪ B. Time complexity is O(log(n))

The author then proceeds to offer an algorithm for the problem I am trying to solve that uses the median algorithm above:

Select(A[1 .. m],B[1 .. n],k):
  if k < (m + n)/2
    return Median(A[1 .. k], B[1 .. k])
  else
    return Median(A[k−n .. m], B[k-m .. n])

Here, Median is the algorithm with one minor tweak. If Median wants an entry in either A or B that is outside the bounds of the original arrays, it uses the value −∞ if the index is too low, or ∞ if the index is too high.

The author does not provide a proof of correctness for this algorithm, and simply states the time complexity as: O(log min {k, m + n − k}) = O(log(m + n))

I have a few questions about this.

First, why are these bounds used: Median(A[k−n .. m], B[k-m .. n]). Specifically, why [k-n .. m] and [k-m .. n]? I would have appreciated a proof of correctness for the algorithm

Second, how is O(log(min{k, m + n − k})) equal to O(log(m + n))?

Here's a link to the answer I found


Solution

  • Considering this algorithm:

    Select(A[1 .. m],B[1 .. n],k):
      if k < (m + n)/2
        return Median(A[1 .. k], B[1 .. k])
      else
        return Median(A[k−n .. m], B[k-m .. n])
    

    Observe that to apply the given Median algorithm, we need component arrays of equal length. Now consider A[1 .. k] and B[1 .. k], as defined in the algorithm. These both have length k. Between the two of them, they must contain the kth largest overall and all smaller ones. Any elements added via an expansion of the original arrays are greater than the kth. Therefore, of the 2k elements, exactly k - 1 are smaller than the kth overall, hence the kth is the median of these -- supposing that we define the median of 2x items as the xth, which is a bit unusual.

    Note well that the above does not depend on the magnitude of k (as long as it does not exceed m + n, which would be invalid). But if k is close to m + n then the inputs to the Median algorithm are large, too, so we consider an alternative approach.

    If the kth element overall is in A then its index there can be no less than k - n (occurring when all n elements of B are less), and likewise, if it is in B then its index there can be no less than k - m. Therefore, consider alternative arrays A[k−n .. m] and B[k-m .. n]. These both have length m + n + 1 - k, and any elements added by the expansions relative to the original arrays (at indices less than 1) are smaller than the kth. The total number of elements is 2(m + n + 1 - k), and exactly n + m - k of them are greater than the kth overall. That leaves n + m - k + 1 less -- oops. With the median definition required for the other case, we need it the other way around: n + m - k less and n + m - k + 1 greater. This approach is a bust.

    First, why are these bounds used: Median(A[k−n .. m], B[k-m .. n]). Specifically, why [k-n .. m] and [k-m .. n]?

    I think you will see the idea now, but it doesn't quite work out after all.

    Second, how is O(log(min{k, m + n − k})) equal to O(log(m + n))?

    Taking k, m, and n all to be size parameters, min{k, m + n − k} is maximized when k = (m + n) / 2, in which case the result is also (m + n) / 2. In this particular view, then, O(log(min{k, m + n − k})) = O(log((m + n) / 2)) = O(log(m + n)).