algorithmsortingheapsortbucket-sortexternal-sorting

Merging 1000s of files into one sorted file with memory constraints - heap vs bucket sort


The numbers are unique within each file and across all files. Here is one popular approach (external sorting) to merge n files into one sorted file when our compute machine cannot store all the data:

While I do understand how this algorithm helps in memory usage - there's a lot of I/O overhead involved.

Another possible approach uses bucket sort - specially if numbers are uniformly distributed:

Now, I believe the latter approach can be slightly faster or at least equivalent in terms of time complexity, but uses more storage space for temp files. Are these 2 approaches comparable in time complexity or 1st one is always better even after multiple I/O calls? Am I missing some time-consuming step in the 2nd approach that makes it less optimum?


Solution

  • My understanding is that it is the other way around. If numbers have a predictable distribution in a predictable range, bucket sort is faster. (In the limit O(n) vs O(n log(n)) for comparison based sorts.) Better yet, buckets can be sorted independently from each other. Which makes them perfect for a distributed sort.