I'm trying to create a more efficient way to sort lists and dictionaries in python and came across Efficient data structure keeping objects sorted on multiple keys. There the suggested solution was to use the heapq module.
However, in my tests a heap seems to be twice as slow as the native Python sorting algorithm. Below is the code I used to do a simple test. Results were for example:
Heap: 0.005993366241455078
Standard: 0.0020036697387695312
Is there a way to actually use a heap and increase the performance, as the post linked above claims? What would that code look like?
Here's the code to test it:
import random
import time
from heapq import *
standardlist = []
heaplist = []
for i in range(10000):
num = random.randint(0,10000)
standardlist.append(num)
heappush(heaplist, num)
# Standard sorting method:
start_time = time.time()
sorted_list = sorted(standardlist)
finish_time_1 = time.time() - start_time
# Heap sorting method:
start_time = time.time()
heap_sorted_list = [heappop(heaplist) for i in range(len(heaplist))]
finish_time_2 = time.time() - start_time
print("Standard Finish Time:", finish_time_1)
print("Heap Finish Time:", finish_time_2)
A heap data structure may be the right solution when you have a collection that changes over time (through insertions and deletions), and at each moment you want to have fast access to the minimum entry that is currently in the collection, and possibly extract it. There were such requirements in the Q&A you linked to.
If however the goal is just to sort a dataset once, then using a heap is not the most efficient.
Some remarks on your code:
The way you have populated the heap has O(𝑛log𝑛) time complexity. It is more efficient to first populate the list, and then call heapify
on it: this has a time complexity of O(𝑛). Admittedly, this is not relevant for the timings you performed, but it also results in less code:
standardlist = [random.randint(0,100000) for _ in range(1000000)]
heaplist = standardlist[:]
heapify(heaplist)
sorted
is a native Python function, which can rely on compiled C code for the sort
part. An explicit loop written in Python cannot beat that.
Although heapsort has an optimal time complexity, a well-implemented quicksort is usually 2–3 times faster than heapsort. See also Quicksort vs heapsort. Python actually uses a highly optimised sort algorithm.
Your code only performs one test, and it completes in a few milliseconds. This does not give very representative results. It would be better to repeat the test multiple times and get an average.