sortingawksplit

Successive sorting of repeating data but different number of lines


I have a data with following format:

2
1
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
4
2
A    26.274099   38.533298   32.624298         4
B    36.459000   29.662701   17.806299         5
A    15.850300   28.130899   24.753901         4
A    32.098000   33.665699   20.372499         4
5
3
A    17.501801   44.279202    8.005670         5
B    35.853001   43.095901   17.402901         4
B     1.326100   17.127600   39.600300         4
A     9.837760   41.103199   13.062300         5
B    31.686800   44.997501   16.619499         4
3
4
B    31.274700    8.726580   25.267599         4
A    19.032400   41.384701   19.456301         5
B    19.441900   24.286400    6.961680         4
1
5
B    48.973999   15.508400    5.099710         4
6
6
A    42.586601   21.343800   23.280701         5
B    30.145800   13.256200   30.713800         4
B    29.186001   44.353699    9.057160         4
A    39.311199   27.371201   35.473999         5
B    31.437799   30.415001   37.454399         4
B    48.501999    1.277730   21.900600         4
...

The data consists of multiple repeating data blocks. Each data block has the same format as follows:

First line = Number of lines of data block
Second line = ID of data block
From third line = Actual data lines (with number of lines designated in the first line)

For example, if we see the first data block:

2 (=> This data block has 2 number of lines) 
1 (=> This data block's ID is 1 (very first data block)) 
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
(Third and fourth lines of the data block are actual data. 
 The integer of the first line, the number of lines for this data block, is 2, so the actual data consist of 2 lines. 
 All other data blocks follow the same format. ) 
  1. The first two lines are headers of each repeating data block. The first line designates the number of lines of the data block, and the second line is the ID of the data block.
  2. Then the real data follows from the third line of each data block. As I explained, the number of lines of "real data" is designated in the first line as an integer number.
  3. Accordingly, the total number of lines for each data block will be number_of_lines+2. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...

This format repeats 100,000 times. In other words, I have 100,000 data blocks just like this example in the single text file. I can provide the total number of the data blocks.

What I wish to do is sort "Actual data lines" of each data block by the fourth column and print out the data in the same format as the original data file. What I wish to achieve from the examples above will look like:

2
1
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
4
2
A    26.274099   38.533298   32.624298         4
A    15.850300   28.130899   24.753901         4
A    32.098000   33.665699   20.372499         4
B    36.459000   29.662701   17.806299         5
5
3
B     1.326100   17.127600   39.600300         4
B    35.853001   43.095901   17.402901         4
B    31.686800   44.997501   16.619499         4
A     9.837760   41.103199   13.062300         5
A    17.501801   44.279202    8.005670         5
3
4
B    31.274700    8.726580   25.267599         4
A    19.032400   41.384701   19.456301         5
B    19.441900   24.286400    6.961680         4
1
5
B    48.973999   15.508400    5.099710         4
6
6
B    31.437799   30.415001   37.454399         4
A    39.311199   27.371201   35.473999         5
B    30.145800   13.256200   30.713800         4
A    42.586601   21.343800   23.280701         5
B    48.501999    1.277730   21.900600         4
B    29.186001   44.353699    9.057160         4
...

I know the typical command to sort using the 4th column the data in Linux, which is:

sort -gk4 data.txt > data_4sort.txt

However, in this case, I need to perform sorting per each data block. Plus, each data block does not have a uniform number of lines, but the number of data lines for each data block varies.

I really have no idea how can I even approach this. I'm considering using a shell script with for loop and/or awk, sed, split, etc, with sort command. But I'm not even sure how can I use the for loop with this problem, or if the for loop and shell script really required to perform this block-wise sorting.


Solution

  • This can be done using a combination of awk and sort. You use awk to drive the whole process as a very simple state machine, and it follows these rules:

    The actions are basically, for each line.

    At the end, sort/print/delete the sort file if needed (since there won't be a final transition from long line to short line after the final block, this action takes care of that).

    That processing is implemented as per below, a shell script that calls awk to do the work:

    awk '
        function sort() {
            close(tmp_file)
            system("sort -rnk4 "tmp_file"; rm -rf "tmp_file)
        }
        BEGIN { tmp_file = "/tmp/tmp_file" }
        state == 0 && NF == 1 { print ; next }
        state == 0 && NF != 1 { print >tmp_file ; state = 1 ; next }
        state == 1 && NF != 1 { print >>tmp_file ; next }
        state == 1 && NF == 1 { sort() ; print ; state = 0 ; next }
        END { if (state == 1) { sort() } }
    ' input_file
    

    You can also do it without a temporary file with an advanced enough awk variant, one that allows for pipes:

    awk 'NF > 1 { print | "sort -rnk4"; next }
                { close("sort -rnk4"); print }
    ' input_file
    

    The advantage of this is a much simpler and shorter awk program that does most of the stuff in the previous script implicitly.

    Running either of those scripts over your input file generates, as requested, each separate grouping sorted on the fourth column (the final float):

    2
    1
    A    11.364500   48.395199   40.160500         5
    B    35.067699   30.944700   24.155300         4
    4
    2
    A    26.274099   38.533298   32.624298         4
    A    15.850300   28.130899   24.753901         4
    A    32.098000   33.665699   20.372499         4
    B    36.459000   29.662701   17.806299         5
    5
    3
    B     1.326100   17.127600   39.600300         4
    B    35.853001   43.095901   17.402901         4
    B    31.686800   44.997501   16.619499         4
    A     9.837760   41.103199   13.062300         5
    A    17.501801   44.279202    8.005670         5
    3
    4
    B    31.274700    8.726580   25.267599         4
    A    19.032400   41.384701   19.456301         5
    B    19.441900   24.286400    6.961680         4
    1
    5
    B    48.973999   15.508400    5.099710         4
    6
    6
    B    31.437799   30.415001   37.454399         4
    A    39.311199   27.371201   35.473999         5
    B    30.145800   13.256200   30.713800         4
    A    42.586601   21.343800   23.280701         5
    B    48.501999    1.277730   21.900600         4
    B    29.186001   44.353699    9.057160         4