bashsortinggnu-sort

GNU sort inconsistent behaviour for empty columns


I'm using the sort (GNU coreutils, version 8.4) utility to sort a file by its first five columns, all of which are numeric and seperated by a tab. To do so I'm using the following call:

sort --field-separator=$'\t' -nk1 -nk2 -nk3 -nk4 -nk5 myFileUnsorted.bcp >  myFileSorted.bcp

This works fine for the most part, but I'm getting some (seemingly) inconsistent behaviour when there are empty values. In my specific case the entries in the third (and fourth) column are empty, and what I would expect the sorted result to look like is this:

...
1   2           0   ...
1   2           84  ...
1   2           168 ...
...

In my output file I'm however getting the following order:

1   2           0   ...
1   2   1       0   ...
1   2   1       84  ...
...
1   2   64      168 ...
1   2           84  ...
1   2           168 ...

Regardless of whether the entries containing empty values in the third(/fourth) column should be placed at the beginning or the end, I would expect them to be placed together.

Looking at the three lines in question in a hex editor (vim version 7.4 with :%!xxd) I get the following:

31 09 32 09 09 09 30 09        ...
31 09 32 09 09 09 38 34 09     ...
31 09 32 09 09 09 31 36 38 09  ...

This leads me to believe that there are no special, invisible characters in the empty columns that could be responsible for them being sorted apart from one another.

Does anyone know why sort orders the lines the way it does? Is it possible to have them be arranged in the way they are in my first example/expected output? Thanks in advance!

I'm using bash (GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)) and have tried ksh (version AJM 93u+), both have yielded the same result, if that makes any difference.


Solution

  • By using -nk3, you told sort to sort on the values starting in the third column, but you didn't tell it where they end, so it used the whole remaining line as the value.

    To only use the specific column, use

    -nk3,3
    

    In fact, I'd use the same notation for all the columns where I don't want to include the rest of the line.

    sort --field-separator=$'\t' -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 \
        myFileUnsorted.bcp > myFileSorted.bcp