There doesn't seem to be any preexisting questions on this, at least from a title search. I am seeking to find the optimal amount of passes for an external merge. So, if we have 1000 chunks of data, one pass would be a 1000 way merge. Two pass could be 5 groups of 200 chunks, then a final merge of 1 group of 5 chunks. And so on. I've done some math, which must have a flaw, because it looks like two passes never beats one pass. It could very well be a misunderstanding in how data is read, though.
First, a numerical example:
Data: 100 GB
Ram: 1 GB
Since we have 1GB memory, we can load in 1GB at a time to sort using quicksort or mergesort. Now we have 100 chunks to sort. We can do a 100 way merge. This is done by making RAM/(chunks+1)
size buckets = 1024MB/101
= 10.14MB
. There are 100 10.14MB
buckets for each of the 100 chunks, and one output bucket also of size 10.14MB
. As we merge, if any input buckets empty, we do a disk seek to refill that bucket. Likewise, when the output bucket gets full, we write to the disk and empty it. I claim that the number of "times the disk needs to read" is (data/ram)*(chunks+1)
. I get this from the fact that we have ram/(chunks+1)
sized input buckets, and we must read in the entire data for a given pass, so we read (data/bucket_size)
times. In other words, every time an input bucket empties we must refill it. We do this over 100 chunks here, so numChunks*(chunk_size/bucket_size)
= datasize/bucket_size
or 100*(1024MB/10.14MB)
. BucketSize = ram/(chunks+1)
so 100*(1024/10.14)
= (data/ram) * (chunks+1)
= 1024*100MB/1024MB * 101
= 10100 reads.
For a two pass system, we do A groups of B #chunks, then a final merge of 1 group of A #chunks. Using previous logic, we have numReads = A*( (data/ram)*(B+1)) + 1*( (data/ram)*(A+1))
. We also have A*B
= Data/Ram
. For instance, 10 groups of 10 chunks, where each chunk is a GB. Here, A = 10 B = 10. 10*10 = 100/1 = 100, which is Data/Ram
. This is because Data/Ram
was the original number of chunks. For 2 pass, we want to break Data/Ram
into A groups of B #chunks.
I'll try to break down the formula here, let D = data, A = #groups, B = #chunks/group, R = RAM
A*(D/R)*(B+1) + 1*(D/R)*(A+1)
- This is A times the number of reads of an external merge on B #chunks plus the final merge on A #chunks.
A = D/(R*B) => D^2/(B*R^2) * (B+1) + D/R * (D/(R*B)+1)
(D^2/R^2)*[1 + 2/B] + D/R
is number of reads for a 2 pass external merge. For 1 pass, we have (data/ram)*(chunks+1)
where chunks = data/ram for 1 pass. Thus, for one pass we have D^2/R^2 + D/R
. We see that a 2 pass only reaches that as the chunk size B goes to infinity, and even then the additional final merge gives us D^2/R^2 + D/R
. So there must be something about the reads I'm missing, or my math is flawed. Thanks to anyone who takes the time to help me!
You ignore the fact that the total time it takes to read a block of data from disk is the sum of
As the number of chunks increases, the size of the input buffers (you call them buckets) decreases. The smaller the input buffers get, the more pronounced is the effect of the constant access time on the total time is takes to fill a buffer. At a certain point, the time to fill a buffer will be almost completely dominated by the access time. So the total time for a merge pass begins to scale with the number of buffers and not the amount of data read.
That's where additional merge passes can speed up the process. It allows to use fewer and larger input buffers and mitigates the effect of access time.
Edit: Here's a quick back-of-the-envelope calculation to give an idea about where the break-even point is.
The total transfer time can be calculated easily. All the data has to read and written once per pass:
total_transfer_time = num_passes * 2 * data / transfer_rate
The total access time for buffer reads is:
total_access_time = num_passes * num_buffer_reads * access_time
Since there's only a single output buffer, it can be made larger than the input buffers without wasting too much memory, so I'll ignore the access time for writes. The number of buffer reads is data / buffer_size
, buffer size is about ram / num_chunks
for the one-pass approach, and the number of chunks is data / ram
. So we have:
total_access_time1 = num_chunks^2 * access_time
For the two-pass solution, it makes sense to use sqrt(num_chunks)
buffers to minimize access time. So buffer size is ram / sqrt(num_chunks)
and we have:
total_access_time2 = 2 * (data / (ram / sqrt(num_chunks))) * acccess_time
= 2 * num_chunks^1.5 * access_time
So if we use transfer_rate = 100 MB/s
, access_time = 10 ms
, data = 100 GB
, ram = 1 GB
, the total time is:
total_time1 = (2 * 100 GB / 100 MB/s) + 100^2 * 10 ms
= 2000 s + 100 s = 2100 s
total_time2 = (2 * 2 * 100 GB / 100 MB/s) + 2 * 100^1.5 * 10 ms
= 4000 s + 20 s = 4020 s
The effect of access time is still very small. So let's change data to 1000 GB:
total_time1 = (2 * 1000 GB / 100 MB/s) + 1000^2 * 10 ms
= 20000 s + 10000 s = 30000 s
total_time2 = (2 * 2 * 1000 GB / 100 MB/s) + 2 * 1000^1.5 * 10 ms
= 40000 s + 632 s = 40632 s
Now half the time in the one-pass version is spent with disk seeks. Let's try with 5000 GB:
total_time1 = (2 * 5000 GB / 100 MB/s) + 5000^2 * 10 ms
= 100000 s + 250000 s = 350000 s
total_time2 = (2 * 2 * 5000 GB / 100 MB/s) + 2 * 5000^1.5 * 10 ms
= 200000 s + 7071 s = 207071 s
Now the two-pass version is faster.