pythonalgorithmsortingmergesort

Why is Merge Sort Performance LINEAR?


I'm working on comparing sorting algorithms in Python. I've implemented the algorithms and a testing function that evaluates each algorithm on various input sizes, up to a very large maximum, in the hundreds of thousands. The execution times are saved for plotting later. As expected, Selection Sort exhibits a quadratic behavior, and its graph reflects that. However, I'm encountering an issue with Merge Sort – its graph appears linear instead of the expected nlogn behavior!!

This is my implementation of Merge Sort:

def merge_sort(A):
    merge_sort_aux(A, 0, len(A) - 1)

# Merge Sort Helper Function
# (Recursive)
def merge_sort_aux(A, p, r):
    if p < r:
        q = (p + r) // 2  # Integer division
        merge_sort_aux(A, p, q)
        merge_sort_aux(A, q + 1, r)
        merge(A, p, q, r)

def merge(A, p, q, r):
    # Calculate the sizes of the two sublists
    n1 = q - p + 1
    n2 = r - q

    # Create two NumPy arrays initially initialized to 0 
    # They are of type float to support sentinels
    L = np.zeros(n1 + 1, dtype=float)
    R = np.zeros(n2 + 1, dtype=float)

    # Copy data into NumPy arrays
    L[:n1] = [A[p + i] for i in range(n1)]
    R[:n2] = [A[q + 1 + j] for j in range(n2)]

    # Sentinels
    L[n1] = np.inf
    R[n2] = np.inf

    i = j = 0

    for k in range(p, r + 1):
        if L[i] <= R[j]:
            A[k] = L[i]
            i += 1  # Increment the index of the left sublist to preserve the loop invariant
        else:
            A[k] = R[j]
            j += 1  # Increment the index of the right sublist

This is my function that I use to run the tests

def test_algo(algo_function, maxRange=TEST_MAX_DIMENSION, minRange=2, step=TEST_STEP):
    """
    Test the execution time of the given algorithm function for various input sizes.

    Parameters:
    - algo_function: The algorithm function to be tested.
    - maxRange: The maximum input size to be tested.
    - minRange: The minimum input size to start the testing.
    - step: The step size between different input sizes.

    Returns:
    - results: A dictionary to store the execution times for each input size.
    """

    results = {}  # Dictionary to store execution times for each input size

    for i in range(minRange, maxRange, step):
        A = random_list(i)  # Assuming you have a function 'random_list' generating a list of size i
        start = timer()
        algo_function(A)
        end = timer()
        results[i] = end - start

    return results

Merge Sort perfromance

Any suggestions or explanations would be greatly appreciated!

EDIT

I have tested up to a list 3 million values enter image description here


Solution

  • The problem is not that merge sort has linear growth, it's that O(n log n) and O(n) are very hard to distinguish visually over bounded ranges. Unless the results are scaled to account for the dominant trend effect and plotted jointly, as below, the human eye doesn't notice small amounts of curvature.

    x log(x) and 7.15 * x - 370 plotted jointly

    Even using statistical tools such as linear regression, a straight line fitted to O(n log n) data will yield a very high value for R2, a measure of how much of the variation in the vertical axis is explained by the value along the horizontal axis.

    We generally don't know how to scale the linear or n log n values to be able to plot them side by side, but once you accept the idea of scaling there's a simple alternative. You can scale the results of your experiments using division by candidate complexities, such as n, n log n, n**2, etc., and plot the scaled result against n (or log n if the range is large). If the generated plot

    To see that this works even when lower order terms are involved in the actual complexity, consider the following example where t(n) = c1 * n**2 + c2 * n + c3 with unknown constants c1, c2, and c3. If we incorrectly scale by n, we get

    t(n) / n = (c1 * n**2 / n) + (c2 * n / n) + (c3 / n)
             = c1 * n + c2 + (c3 / n)
    

    which increases as n does. If we overdo it scaling by n**3, we get:

    t(n) / n**3 = (c1 * n**2 / n**3) + (c2 * n / n**3) + (c3 / n**3)
                = (c1 / n) + (c2 / n**2) + (c3 / n**3)
    

    which will asymptotically converge to zero. When we scale by the Goldilocks complexity n**2:

    t(n) / n**2 = (c1 * n**2 / n**2) + (c2 * n / n**2) + (c3 / n**2)
                = c1 + (c2 / n) + (c3 / n**2)
    

    the lower order terms rapidly converge to zero, and the scaled times will converge towards c1.

    I used this approach in a paper published in 2018 (DOI: 10.1002/qre.2424) to show that the rate of growth for optimal solutions (or best identified feasible solutions when optimal couldn't be determined) was greater than O(k2) but less than O(k3):

    plot of performance measure / k**3