pythonalgorithmsortingheapexternal-sorting

Using heap for big disk sorts


On the official Python docs here, it is mentioned that:

Heaps are also very useful in big disk sorts. You most probably all know that a big sort implies producing “runs” (which are pre-sorted sequences, whose size is usually related to the amount of CPU memory), followed by a merging passes for these runs, which merging is often very cleverly organised.
It is very important that the initial sort produces the longest runs possible. Tournaments are a good way to achieve that. If, using all the memory available to hold a tournament, you replace and percolate items that happen to fit the current run, you’ll produce runs which are twice the size of the memory for random input, and much better for input fuzzily ordered.

Moreover, if you output the 0’th item on disk and get an input which may not fit in the current tournament (because the value “wins” over the last output value), it cannot fit in the heap, so the size of the heap decreases. The freed memory could be cleverly reused immediately for progressively building a second heap, which grows at exactly the same rate the first heap is melting.
When the first heap completely vanishes, you switch heaps and start a new run. Clever and quite effective!

I am aware of an algorithm called External sorting in which we:

  1. Break down the input into smaller chunks.
  2. Sort all the chunks individually and write them back to disk one-by-one.
  3. Create a heap and do a k-way merge among all the sorted chunks.

I completely understood external sorting as described on Wikipedia, but am not able to understand the author when they say:

If, using all the memory available to hold a tournament, you replace and percolate items that happen to fit the current run, you’ll produce runs which are twice the size of the memory for random input, and much better for input fuzzily ordered.

and:

Moreover, if you output the 0’th item on disk and get an input which may not fit in the current tournament (because the value “wins” over the last output value), it cannot fit in the heap, so the size of the heap decreases.

What is this heap melting?


Solution

  • Heap melting is not a thing. It's just the word the author uses for the heap getting smaller as to pull out the smallest items.

    The idea he's talking about is a clever replacement for "divide the input into chunks and sort the chunks" part of the external sort. It produces larger sorted chunks.

    The idea is that you first read the biggest chunk you can into memory and arrange it into a heap, then you start writing out the smallest elements from the heap as you read new elements in.

    When you read in an element that is smaller than an element you have already written out, it can't go in the current chunk (it would ruin the sort), so you remember it for the next chunk. Elements that are not smaller than the last one you wrote out can be inserted into the heap. They will make it out into the current chunk, making the current chunk larger.

    Eventually your heap will be empty. At that point you're done with the current chunk -- heapify all the elements you remembered and start writing out the next chunk.