I'm working on comparing sorting algorithms in Python. I've implemented the algorithms and a testing function that evaluates each algorithm on various input sizes, up to a very large maximum, in the hundreds of thousands. The execution times are saved for plotting later. As expected, Selection Sort exhibits a quadratic behavior, and its graph reflects that. However, I'm encountering an issue with Merge Sort – its graph appears linear instead of the expected nlogn behavior!!
This is my implementation of Merge Sort:
def merge_sort(A):
merge_sort_aux(A, 0, len(A) - 1)
# Merge Sort Helper Function
# (Recursive)
def merge_sort_aux(A, p, r):
if p < r:
q = (p + r) // 2 # Integer division
merge_sort_aux(A, p, q)
merge_sort_aux(A, q + 1, r)
merge(A, p, q, r)
def merge(A, p, q, r):
# Calculate the sizes of the two sublists
n1 = q - p + 1
n2 = r - q
# Create two NumPy arrays initially initialized to 0
# They are of type float to support sentinels
L = np.zeros(n1 + 1, dtype=float)
R = np.zeros(n2 + 1, dtype=float)
# Copy data into NumPy arrays
L[:n1] = [A[p + i] for i in range(n1)]
R[:n2] = [A[q + 1 + j] for j in range(n2)]
# Sentinels
L[n1] = np.inf
R[n2] = np.inf
i = j = 0
for k in range(p, r + 1):
if L[i] <= R[j]:
A[k] = L[i]
i += 1 # Increment the index of the left sublist to preserve the loop invariant
else:
A[k] = R[j]
j += 1 # Increment the index of the right sublist
This is my function that I use to run the tests
def test_algo(algo_function, maxRange=TEST_MAX_DIMENSION, minRange=2, step=TEST_STEP):
"""
Test the execution time of the given algorithm function for various input sizes.
Parameters:
- algo_function: The algorithm function to be tested.
- maxRange: The maximum input size to be tested.
- minRange: The minimum input size to start the testing.
- step: The step size between different input sizes.
Returns:
- results: A dictionary to store the execution times for each input size.
"""
results = {} # Dictionary to store execution times for each input size
for i in range(minRange, maxRange, step):
A = random_list(i) # Assuming you have a function 'random_list' generating a list of size i
start = timer()
algo_function(A)
end = timer()
results[i] = end - start
return results
Any suggestions or explanations would be greatly appreciated!
EDIT
The problem is not that merge sort has linear growth, it's that O(n log n) and O(n) are very hard to distinguish visually over bounded ranges. Unless the results are scaled to account for the dominant trend effect and plotted jointly, as below, the human eye doesn't notice small amounts of curvature.
Even using statistical tools such as linear regression, a straight line fitted to O(n log n) data will yield a very high value for R2, a measure of how much of the variation in the vertical axis is explained by the value along the horizontal axis.
We generally don't know how to scale the linear or n log n
values to be able to plot them side by side, but once you accept the idea of scaling there's a simple alternative. You can scale the results of your experiments using division by candidate complexities, such as n
, n log n
, n**2
, etc., and plot the scaled result against n
(or log n
if the range is large). If the generated plot
n
=> the conjectured complexity is too low;n
=> the conjecture is too high;To see that this works even when lower order terms are involved in the actual complexity, consider the following example where t(n) = c1 * n**2 + c2 * n + c3
with unknown constants c1
, c2
, and c3
. If we incorrectly scale by n
, we get
t(n) / n = (c1 * n**2 / n) + (c2 * n / n) + (c3 / n)
= c1 * n + c2 + (c3 / n)
which increases as n
does. If we overdo it scaling by n**3
, we get:
t(n) / n**3 = (c1 * n**2 / n**3) + (c2 * n / n**3) + (c3 / n**3)
= (c1 / n) + (c2 / n**2) + (c3 / n**3)
which will asymptotically converge to zero. When we scale by the Goldilocks complexity n**2
:
t(n) / n**2 = (c1 * n**2 / n**2) + (c2 * n / n**2) + (c3 / n**2)
= c1 + (c2 / n) + (c3 / n**2)
the lower order terms rapidly converge to zero, and the scaled times will converge towards c1
.
I used this approach in a paper published in 2018 (DOI: 10.1002/qre.2424) to show that the rate of growth for optimal solutions (or best identified feasible solutions when optimal couldn't be determined) was greater than O(k2) but less than O(k3):