arraysalgorithmsortingexternal-sorting

External Sort between two files


I'm trying to get my head around an external sort for a requirement I have - and I can't.

The requirement is to externally sort a file of an arbitrary size but using just the original file and one other (call them fileA and fileB) - two files including the original. I can read/write to either of these - so can swap between the two...

I cannot figure out how to implement this - as most sorting algorithms require you to be able to have an overview of the entire array in memory to sort it, surely?

Say I have a random integer array:

[1, 5, 8, 7, 3, 4, 1, 9, 0, 1, 8, 7, 7, 3, 2, 9, 1, 2];

And at any given time, I can only read four pages (e.g. four integers) into memory.

On each pass, this gives me five separate arrays to sort:

[1, 5, 8, 7]
[3, 4, 1, 9] 
[0, 1, 8, 7] 
[7, 3, 2, 9]
[1, 2]

If I apply an in-memory sort on these, I then get:

[1, 5, 7, 8]
[1, 3, 4, 9] 
[0, 1, 7, 8] 
[2, 3, 7, 9]
[1, 2]

But if I can only fit four pages into memory at any one time, I don't see how I can further sort these without some horrible complex algorithm which loops over the entire array again and again to ensure its all sorted.

I'm thoroughly confused - because without reading the entire array into memory, we have no idea what elements are before the four pages, or after - so we cannot truly sort them?

Can somebody help me please and explain the crucial step in solving this?


Solution

  • Since, the basic idea of External Sort is to merge the lists that are bigger than the available memory, therefore you control the lists (which may be actually implemented as arrays or linked lists or anything else) by means of a handle over them. In order to read an element from a list, you will use some method like listHandle.getNextElement(). To write to disk in the list, use mergedDoubleSizedList.writeNextElement().

    After you have:

    [1, 5, 7, 8] // controlled using handle1
    [1, 3, 4, 9] // controlled using handle2
    [0, 1, 7, 8] // controlled using handle3
    [2, 3, 7, 9] // controlled using handle4
    [1, 2] // controlled using handle5
    

    and that you read only 4 ints, you get the handle over the first two arrays (handle1 and handle2), read them element by element simultaneously, and write them back as one consolidated array which is sorted (mergedListHandle1). Like this:

    [1, 1, 3, 4, 5, 7, 8, 9] // written by creating new handle - mergedListHandle1
    [0, 1, 2, 3, 7, 7, 8, 9] // written by creating - mergedListHandle2
    [1, 2] // written back by creating mergedListHandle3
    

    Now again you get the handle over two of the arrays (mergedListHandle1 and mergedListHandle2) resulting from the previous step and keep merging them on and on until you are left with only two handles, resulting in one final sorted array. Please provide your code if you want the code based solution for the same.

    At a time, you will have only 4 elements in the memory if that's what your memory allows. So, to merge the lists represented by handle1 and handle2, you will do the following:

    1. Read first element from handle1 and handle2 (1 and 1)
    2. Write the smaller of these two to mergedListHandle1 (i.e. write 1 of handle1)
      1. You may not flush the numbers in mergedListHandle1 at the moment.
    3. Read next element from handle1 (5)
    4. Write the smaller of the current numbers from handle1 and handle2 to mergedListHandle1
    5. Flush the contents of mergedListHandle1 when it is full
    6. Fetch next smaller handles from disks (handle3 and handle4) and repeat the same cycle with them while you write to the new bigger list handle called mergedListHandle2.