sortingcommand-linefieldmultiple-columns

Using sort with numeric columns separated by letters & underscores


I'm trying to sort phrases such as the following:

a12_b7
a12_b11
a5_b3
a5_b30
a12_b10

using the numbers following the letters, lexicographically. For the example above, I expect the result to be:

a5_b3
a5_b10
a12_b7
a12_b11
a12_b30

Reading man sort, I thought I had this figured out:

But - that does not work like I thought it would:

$ cat | sort --debug --key=1.2 --key=2.2 --field-separator=_
sort: text ordering performed using ‘en_IL.UTF-8’ sorting rules
a12_b10
 ______
     __
_______
a12_b11
 ______
     __
_______
a12_b7
 _____
     _
______
a5_b3
 ____
    _
_____
a5_b30
 _____
    __
______

What have I gotten wrong? And what would be the appropriate sort command-line in this case?


Solution

  • Looks like you want to sort numerically when the default sort order is alphabetically. You could do:

    $ sort -nt'_' -k1.2 -k2.2 file
    a5_b3
    a5_b30
    a12_b7
    a12_b10
    a12_b11
    

    but if the input was any more complicated than that (e.g. not always single letter chars before each sort key) then I'd use the Decorate-Sort-Undecorate idiom, e.g.:

    $ cat file
    phc12_bob7
    efg12_bk11
    cfad5_xxxx3
    df5_chekb30
    a12_tg10
    

    $ sed -E 's/([^0-9]+)(.*)(_[^0-9]+)(.*)/\1\t\2\t\3\t\4/' file
    phc     12      _bob    7
    efg     12      _bk     11
    cfad    5       _xxxx   3
    df      5       _chekb  30
    a       12      _tg     10
    

    $ sed -E 's/([^0-9]+)(.*)(_[^0-9]+)(.*)/\1\t\2\t\3\t\4/' file | sort -k2,2n -k4,4n
    cfad    5       _xxxx   3
    df      5       _chekb  30
    phc     12      _bob    7
    a       12      _tg     10
    efg     12      _bk     11
    

    $ sed -E 's/([^0-9]+)(.*)(_[^0-9]+)(.*)/\1\t\2\t\3\t\4/' file | sort -k2,2n -k4,4n | tr -d '\t'
    cfad5_xxxx3
    df5_chekb30
    phc12_bob7
    a12_tg10
    efg12_bk11
    

    The sed modifies (Decorates) the input so the sort command can Sort it, then the tr removes the extra chars that the sed added (Undecorates).