I'm trying to solve the following problem:
Suppose you are given two sorted arrays A[1 .. m] and B[1 .. n] and an integer k. Describe a fast algorithm to find the kth smallest element in the union A ∪ B. For example, given the input A[1 .. 8] = [0,1,6,9,12,13,18,20], B[1 .. 5] = [2,5,7,17,19], and k = 6, your algorithm should return the integer 7.
I've found many solutions to this online, but none of them made much sense to me. I found an answer that seems very interesting, but I'm having trouble understanding why the algorithm works.
First, the author provides this algorithm to find the median of the union of two arrays:
Median(A[1 .. n],B[1...n]) :
if n < 100:
use brute force
else if A[n/2] > B[n/2]:
return Median(A[1 . . n/2], B[n/2 + 1 . . n])
else
return Median(A[n/2 + 1 . . n], B[1 . . n/2])
The proof for this is:
Suppose A[n/2] > B[n/2]. Then A[n/2+1] is larger than all n elements in A[1 .. n/2]∪B[1 .. n/2], and therefore larger than the median of A ∪ B, so we can discard the upper half of A. Similarly, B[n/2 − 1] is smaller than all n + 1 elements of A[n/2 . . n] ∪ B[n/2 + 1 . . n], and therefore smaller than the median of A ∪ B, so we can discard the lower half of B. Because we discard the same number of elements from each array, the median of the remaining subarrays is the median of the original A ∪ B. Time complexity is O(log(n))
The author then proceeds to offer an algorithm for the problem I am trying to solve that uses the median algorithm above:
Select(A[1 .. m],B[1 .. n],k):
if k < (m + n)/2
return Median(A[1 .. k], B[1 .. k])
else
return Median(A[k−n .. m], B[k-m .. n])
Here, Median is the algorithm with one minor tweak. If Median wants an entry in either A or B that is outside the bounds of the original arrays, it uses the value −∞ if the index is too low, or ∞ if the index is too high.
The author does not provide a proof of correctness for this algorithm, and simply states the time complexity as: O(log min {k, m + n − k}) = O(log(m + n))
I have a few questions about this.
First, why are these bounds used: Median(A[k−n .. m], B[k-m .. n]). Specifically, why [k-n .. m] and [k-m .. n]? I would have appreciated a proof of correctness for the algorithm
Second, how is O(log(min{k, m + n − k})) equal to O(log(m + n))?
Here's a link to the answer I found
Considering this algorithm:
Select(A[1 .. m],B[1 .. n],k): if k < (m + n)/2 return Median(A[1 .. k], B[1 .. k]) else return Median(A[k−n .. m], B[k-m .. n])
Observe that to apply the given Median
algorithm, we need component arrays of equal length. Now consider A[1 .. k]
and B[1 .. k]
, as defined in the algorithm. These both have length k
. Between the two of them, they must contain the k
th largest overall and all smaller ones. Any elements added via an expansion of the original arrays are greater than the k
th. Therefore, of the 2k
elements, exactly k - 1
are smaller than the k
th overall, hence the k
th is the median of these -- supposing that we define the median of 2x
items as the x
th, which is a bit unusual.
Note well that the above does not depend on the magnitude of k
(as long as it does not exceed m + n
, which would be invalid). But if k
is close to m + n
then the inputs to the Median algorithm are large, too, so we consider an alternative approach.
If the k
th element overall is in A
then its index there can be no less than k - n
(occurring when all n
elements of B
are less), and likewise, if it is in B
then its index there can be no less than k - m
.
Therefore, consider alternative arrays A[k−n .. m]
and B[k-m .. n]
. These both have length m + n + 1 - k
, and any elements added by the expansions relative to the original arrays (at indices less than 1) are smaller than the k
th. The total number of elements is 2(m + n + 1 - k)
, and exactly n + m - k
of them are greater than the k
th overall. That leaves n + m - k + 1
less -- oops. With the median definition required for the other case, we need it the other way around: n + m - k
less and n + m - k + 1
greater. This approach is a bust.
First, why are these bounds used: Median(A[k−n .. m], B[k-m .. n]). Specifically, why [k-n .. m] and [k-m .. n]?
I think you will see the idea now, but it doesn't quite work out after all.
Second, how is O(log(min{k, m + n − k})) equal to O(log(m + n))?
Taking k
, m
, and n
all to be size parameters, min{k, m + n − k}
is maximized when k = (m + n) / 2
, in which case the result is also (m + n) / 2
. In this particular view, then, O(log(min{k, m + n − k})) = O(log((m + n) / 2)) = O(log(m + n))
.