algorithmmedian

Find a median of two sorted arrays using divide and conquer and recursion


I'm trying to find a median of two sorted arrays, of different sizes, using recursion on a divide and conquer manner. The entry are two sorted arrays and the output is the median of these two sorted arrays.

I am trying to not merge these arrays and then find the median. What I am trying to do is just directly return the median.

Example:

I know there are many possibilities to solve the problem but I want to do the code using recursion and using a divide and conquer manner.

Moreover, while my code works for some cases, it does not work for the case when one of the arrays have size 1. I would love to have suggestions on how to fix the code for this base case. I can see the problem comes from the first two "ifs" on the code. But if I remove them, I won't have a return for the leaves of the recursion tree, so it won't work for other cases.

import math

def medianOfTwo(L1, L2, l1, h1, l2, h2):

  if (h1 == l1): return L1[l1]
  if (h2 == l2): return L2[l2]

  medL1 = math.floor((l1+h1)/2)
  medL2 = math.floor((l2+h2)/2)


  if(L1[medL1] == L2[medL2]):
    return L1[medL1]
  

  if(L1[medL1] < L2[medL2]):
    return medianOfTwo(L1, L2, medL1, h1, l2, medL2)
  else:
    return medianOfTwo(L1, L2, l1, medL1, medL2, h2)


#ar1 = [1, 2, 3, 4, 12] 
#ar2 = [0, 7, 8, 11] 

ar1 = [1, 2, 3, 4] 
#ar2 = [7, 8, 11, 12]
ar2 = [0]

print(medianOfTwo(ar1, ar2, 0, len(ar1)-1, 0, len(ar2)-1))

Solution

  • Here's a recursive divide and conquer method that at each level removes the smallest and largest values from the combination of the two arrays. Once there are two or less values remaining, the median result is the smaller of the remaining values (if you want the average, replace min(arr1 + arr2) with sum(arr1 + arr2)/len(arr1 + arr2)).

    def median(arr1, arr2):
        if len(arr1 + arr2) <= 2:
            return min(arr1 + arr2)
        if len(arr1) == 0:
            return median(arr1, arr2[1:-1])
        elif len(arr2) == 0:
            return median(arr1[1:-1], arr2)
        if arr1[0] < arr2[0]:
            if arr1[-1] > arr2[-1]:
                return median(arr1[1:-1], arr2)
            else:
                return median(arr1[1:len(arr1)+1], arr2[:-1])
        else:
            if arr2[-1] > arr1[-1]:
                return median(arr1, arr2[1:-1])
            else:
                return median(arr1[:-1], arr2[1:len(arr2)+1])
    
    median([1, 2, 3, 4, 12], [0, 7, 8, 11])
    median([1, 2, 3, 4], [0])
    median([10], [0, 5, 9, 12])
    

    Output:

    4
    2
    9
    

    It has been pointed out that this is quite inefficient (O(n^2)) due to the need to make array copies. Here is another version which maintains indexes into each array instead, so should be O(n):

    def median(arr1, arr2):
        def helper(arr1, arr2, l1, h1, l2, h2):
            if max(h1-l1, 0) + max(h2-l2, 0) <= 1:
                return sum(arr1[l1:h1+1] + arr2[l2:h2+1]) / len(arr1[l1:h1+1] + arr2[l2:h2+1])
            if arr1[l1:l1+1] < arr2[l2:l2+1] or len(arr2[l2:l2+1]) == 0:
                if arr1[h1:h1+1] > arr2[h2:h2+1] or len(arr2[h2:h2+1]) == 0:
                    return helper(arr1, arr2, l1+1, h1-1, l2, h2)
                else:
                    return helper(arr1, arr2, l1+1, h1, l2, h2-1)
            else:
                if arr2[h2:h2+1] > arr1[h1:h1+1] or len(arr1[h1:h1+1]) == 0:
                    return helper(arr1, arr2, l1, h1, l2+1, h2-1)
                else:
                    return helper(arr1, arr2, l1, h1-1, l2+1, h2)
        
        return helper(arr1, arr2, 0, len(arr1)-1, 0, len(arr2)-1)
    

    I ran performance testing using perfplot and needless to say @btilly's bisection algorithm is significantly faster. This algorithm also exceeds the maximum recursion depth once the total length of the arrays exceeds ~2,000 entries. Test code:

    def setup(n):
        random.seed('median')
        split = random.randrange(n)
        a = random.sample(range(n), k=split)
        b = set(range(n)).difference(a)
        return sorted(a), sorted(b)
    
    b = perfplot.bench(
        setup=setup,
        kernels = [ median_of_two_sorted_arrays, median ],
        labels = ['bisect', 'shrink'],
        n_range = [2**k for k in range(3, 11)]
    )
    

    Here's the chart: enter image description here