I am studying databases from the book Fundamentals of Database Systems, from authors Elmasri and Navathe, 5th edition, and they explain briefly external sort using merge sort in almost at the beginning of chapter 15. They divide the algorithm in two phases:
1) Sorting: They use the next notation:
In this phase we put in memory as many blocks as we can of the data file, we sort them using any internal sorting algorithm and we write them as a temporary sorted subfile. We repeat this with the rest of the blocks of the file, so we will get more sorted subfiles. Those subfiles are which they call "portions", and the number of them is:
nr = ⌈ b / nb ⌉.
The symbols ⌈ ⌉ denote the ceiling function. The I/O cost of this phase is 2b, because we need to read each block one time (b accesses). Then, to save all the portions we also need to make b accesses.
2) Merging: They say something similar to this (I rewrote it using my interpretation to make it clearer):
The resulting portions (ordered subfiles) are mixed in one or more passes. For each pass an output block is reserved in memory to be placing the result of the mixtures, and the rest are used as input blocks, which can be up to nb - 1, and in which one block is placed at a time of each of the ordered portions, with the purpose of mixing them. More than one pass is needed when there are fewer input blocks than portions. In addition, since each portion can have more than one block, each pass is subdivided into iterations, in each of which a block of each portion is placed.
The number dm must be equal to the minimum between (nb - 1) and nr. If we place the base of the logarithm between ( ), and its argument between 〖〗, the number of passes is:
⌈ log(dm)〖nr〗⌉.
The part I am cofused with is that they say that the cost of this phase is
2b * ⌈ log(dm)〖nr〗⌉,
so they are basically implying that in each pass we only need to read each block once and write it once, but I am not sure if that is correct. I suspect that more accesses may be necessary.
Therefore, the total cost of the algorithm is 2b + 2b * ⌈log(dm)〖nr〗⌉
= 2b (1 + ⌈log(dm)〖nr〗⌉)
Actually, they don't say that in that way, but: "In general, the logarithm is taken in base dm and the expression indicating the number of blocks accessed is as follows:"
(2*b) + (2* (b* (log(dm)〖nr〗))),
which is basically the same.
For example, suppose we have a file of 10 blocks, with 3 records per block. The available space in memory (buffer pool) has the size of 4 blocks. Let's separe the blocks of the file with ||
29,11,27 || 22,1,20 || 7,30,26 || 9,8,21 || 13,24,15 || 23,4,28 || 17,12,10|| 5,3,6 || 16,19,2 || 25,14,18
The number of portions 'nr' that result in the sorting phase is ⌈10/4⌉ = 3.
p1 = 1,7,8 || 9,11,20 || 21,22,26 || 27,29,30
p2 = 3,4,5 || 6,10,12 || 13,15,17 || 23,24,28
p3 = 2,14,16 || 18,19,25
In the meging phase, dm = minimum{nb-1, nr} = minimum{4-1,3} = 3. Then, the number of passes is log(3)〖3〗= 1. According to the formula, we should make 20 I/O's in this phase.
Iteration 1: We put these blocks in memory:
1,7,8 || 3,4,5 || 2,14,16
and they transform into this (once block at a time, which is saved in disk):
1,2,3 || 4,5,7 || 8,14,16
6 access to disk.
Iteration 2:
9,11,20 || 6,10,12 || 18,19,25
and they transform into this:
6,9,10 || 11,12,18 || 19,20,25
6 access to disk (there are already 12 accumulated).
What am I doing wrong, and how do I continue?
I assuming initial pass produces sorted runs of size {3,3,3,3,3,3,3,3,3,3} (10 blocks, 30 records). I'm not sure about dm, but the number of merge passes is ⌈log3⌉(10) = 3. 1st merge pass results in sorted runs of size {9,9,9,3} (10 blocks). 2nd merge pass results in sorted runs of size {27,3} (10 blocks). 3rd merge pass results in a single sorted run {30} (10 blocks).
The initial pass and the 3 merge passes each involve 20 I/O's, for a total of 80 I/O's.