sortingcolumnsortingexternal-sortinggnu-sort

gnu-sort - what does manual mean when it says merge option does "not sort"


I am trying to sort a file that is too big to fit into memory. The man for gnu sort under option -m states: merge already sorted files; do not sort. I am struggling to understand the implications of this in order to make sure that sort accomplishes what I want it to. This post (Sorting in pandas for large datasets) suggests a combination of gnu split and gnu sort to accomplish a task such as this by first breaking the file up into smaller pieces that will fit into memory, sorting each of them individually, and then recombining. My experiments so far seem to indicate that this procedure does indeed work. Nevertheless, I am troubled by the description of the merge option in the manual that says that it does not sort. For my purposes, it is necessary that the large file be entirely sorted, not just a concatenation of smaller pieces that have been locally sorted. Although I have tested the procedure on small examples and it seems to work, the manual makes me lack confidence in applying it to my actual situation, since I am concerned that unexpected behavior may arise in situations where it is no longer feasible to verify whether gnu sort functioned as I intended it to.

To give a MWE, consider this tab separated file that I would like to sort:

3   4
2   5
3   1
1   3

I tried the following operations:

SortDir="/Users/aireties/Desktop/Sort_Experiments"
## sort document as a whole (in practice, this would be infeasible due to document size)
sort --field-separator=$'\t' -k 1,1 -k 2,2 "$SortDir/To_Be_Sorted.txt" -o "$SortDir/Sorted_as_Whole.txt"  ## sort first by the first column values, then by the second

1   3
2   5
3   1
3   4

This is the "correct" solution when sorting the entire file at once (something which is infeasible in my actual use case).

If I try to break the file into pieces and then use the -m option right away, I get an incorrect result:

## Break file into pieces
MaxLines=2
mkdir "$SortDir/Pieces/"
split -l $MaxLines "$SortDir/To_Be_Sorted.txt" "$SortDir/Pieces/"
## Try merge sort on pieces without first sorting them
sort -m --field-separator=$'\t' -k 1,1 -k 2,2 "$SortDir/Pieces/"* -o "$SortDir/Sorted_in_Pieces1.txt"

3   1
1   3
3   4
2   5

What it appears has happened is that gnu sort has just considered the two separate pieces and sorted them with respect to the first values in each other. Thus, it has put the second piece first in this finished product, but has done no other sorting.

Alternatively, if I follow the procedure advocated here (Sorting in pandas for large datasets), which is to first sort the pieces and then merge, I do seem to get the correct result:

for file in "$SortDir/Pieces/"*  ## sorts all text files in pwd
do
  sort --field-separator=$'\t' -k 1,1 -k 2,2 "$file" -o "$file"
done    

sort -m --field-separator=$'\t' -k 1,1 -k 2,2 "$SortDir/Pieces/"* -o "$SortDir/Sorted_in_Pieces2.txt"    

1   3
2   5
3   1
3   4


cmp --silent "$SortDir/Sorted_in_Pieces1.txt" "$SortDir/Sorted_as_Whole.txt" || echo "files are different"
# file are different
cmp --silent "$SortDir/Sorted_in_Pieces2.txt" "$SortDir/Sorted_as_Whole.txt" || echo "files are different"

For me, the sticking point is that if the piece files are large, there is still quite a lot of computation that needs to be done to merge them into a single file that is properly sorted. Thus, I find it hard to wrap my mind around how such a non-trivial amount of sorting could be described as the result of an operation that claims it does "not sort."

Can anyone enlighten me as to why the manual would be phrased as such? Why and how can we be confident that gnu sort will reliably do what it claims when using the merge option? Is the manual text somehow hinting at certain situations in which this procedure would fail to achieve the desired result?


Solution

  • Gnu sort (at least the version I looked at the source code for), will sort chunks of the file in memory and create a set of temporary files (1 temporary file per chunk). It also uses multi-threading during the memory sort phase (a command line parameter can set the max number of threads to use). After all of the temporary files are created, it then does 16 way merges (unless you override this) of the temporary files until it produces a single sorted file.

    The point here is you do not have to split the file into separate files first, as gnu sort will handle a large file automatically, creating sorted temporary files as needed to merge into a single sorted file.

    The -m option is for a special case where multiple already sorted files are to be merged.