algorithmsortingexternal-sorting

How to add data to a bunch of sorted files


I apologize if this has been repeated before, but I couldn't find any posts with the wording I chose. I'm preparing for interviews, and I've been reading up on external sorting. For instance, if you want to sort several hard disks of 32 bit integers, you could make a counting sort and use 64 bit counters to count the 32 bit integers. Then, at every possible 32 bit integer value, you would have a counter representing it. You can also use external merge sort for similar things, taking O(nlogn) time instead of O(1) time. However, I've been thinking about a case that is probably very common, but I can't think of the best way to do it - adding new data to a bunch of sorted files possibly across many hard disks.

If there data were in memory, one could use a heap (priority queue) to accomplish this insert in logn time. However, we can't make a heap from hard disk space. With lists, you would have to use an O(logn) search to find the data's place (for binary search, sorted), then bump the rest of the data backward or forward, or you may not have to shift anything depending on the implementation of the container(arrays, linked lists, etc.). In the hard disk world, though, reads and writes are much more expensive than in RAM, so inserting data somewhere then shifting (rewriting) the rest of the data seems prohibitively expensive. Are there any techniques for this that any of you could recommend to me? I'd be happy to read myself, I just couldn't find the right way to word my question to find any information. Thank you!


Solution

  • I'd say read up that file of your sorted data, read up the file that you want to be sorted and added to there, buckle up the counters and plain overwrite the sorted data file with newly calculated one. Straight reading is largely less expensive on modern disk systems than random reading, and you will anyway need a position for every int you find, thus a single sequential read of the entire volume will be less time-consuming than ~32 reads of a single sector per number of the file to be sorted.

    Also, I'd say sorting 32-bit ints is best done with result already in form of counters, especially at extra-large scale like "several hard disks", you will expect to have at least 1 in almost every bucket in 32-bit space, so storing 64 bits *2^32 could be smaller than say 2^33 32-bit zeroes then 2^32 ones then...