I have a data with following format:
2
1
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
4
2
A 26.274099 38.533298 32.624298 4
B 36.459000 29.662701 17.806299 5
A 15.850300 28.130899 24.753901 4
A 32.098000 33.665699 20.372499 4
5
3
A 17.501801 44.279202 8.005670 5
B 35.853001 43.095901 17.402901 4
B 1.326100 17.127600 39.600300 4
A 9.837760 41.103199 13.062300 5
B 31.686800 44.997501 16.619499 4
3
4
B 31.274700 8.726580 25.267599 4
A 19.032400 41.384701 19.456301 5
B 19.441900 24.286400 6.961680 4
1
5
B 48.973999 15.508400 5.099710 4
6
6
A 42.586601 21.343800 23.280701 5
B 30.145800 13.256200 30.713800 4
B 29.186001 44.353699 9.057160 4
A 39.311199 27.371201 35.473999 5
B 31.437799 30.415001 37.454399 4
B 48.501999 1.277730 21.900600 4
...
The data consists of multiple repeating data blocks. Each data block has the same format as follows:
First line = Number of lines of data block
Second line = ID of data block
From third line = Actual data lines (with number of lines designated in the first line)
For example, if we see the first data block:
2 (=> This data block has 2 number of lines)
1 (=> This data block's ID is 1 (very first data block))
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
(Third and fourth lines of the data block are actual data.
The integer of the first line, the number of lines for this data block, is 2, so the actual data consist of 2 lines.
All other data blocks follow the same format. )
This format repeats 100,000 times. In other words, I have 100,000 data blocks just like this example in the single text file. I can provide the total number of the data blocks.
What I wish to do is sort "Actual data lines" of each data block by the fourth column and print out the data in the same format as the original data file. What I wish to achieve from the examples above will look like:
2
1
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
4
2
A 26.274099 38.533298 32.624298 4
A 15.850300 28.130899 24.753901 4
A 32.098000 33.665699 20.372499 4
B 36.459000 29.662701 17.806299 5
5
3
B 1.326100 17.127600 39.600300 4
B 35.853001 43.095901 17.402901 4
B 31.686800 44.997501 16.619499 4
A 9.837760 41.103199 13.062300 5
A 17.501801 44.279202 8.005670 5
3
4
B 31.274700 8.726580 25.267599 4
A 19.032400 41.384701 19.456301 5
B 19.441900 24.286400 6.961680 4
1
5
B 48.973999 15.508400 5.099710 4
6
6
B 31.437799 30.415001 37.454399 4
A 39.311199 27.371201 35.473999 5
B 30.145800 13.256200 30.713800 4
A 42.586601 21.343800 23.280701 5
B 48.501999 1.277730 21.900600 4
B 29.186001 44.353699 9.057160 4
...
I know the typical command to sort using the 4th column the data in Linux, which is:
sort -gk4 data.txt > data_4sort.txt
However, in this case, I need to perform sorting per each data block. Plus, each data block does not have a uniform number of lines, but the number of data lines for each data block varies.
I really have no idea how can I even approach this. I'm considering using a shell script with for loop and/or awk, sed, split, etc, with sort command. But I'm not even sure how can I use the for loop with this problem, or if the for loop and shell script really required to perform this block-wise sorting.
This can be done using a combination of awk
and sort
. You use awk
to drive the whole process as a very simple state machine, and it follows these rules:
The actions are basically, for each line.
At the end, sort/print/delete the sort file if needed (since there won't be a final transition from long line to short line after the final block, this action takes care of that).
That processing is implemented as per below, a shell script that calls awk
to do the work:
awk '
function sort() {
close(tmp_file)
system("sort -rnk4 "tmp_file"; rm -rf "tmp_file)
}
BEGIN { tmp_file = "/tmp/tmp_file" }
state == 0 && NF == 1 { print ; next }
state == 0 && NF != 1 { print >tmp_file ; state = 1 ; next }
state == 1 && NF != 1 { print >>tmp_file ; next }
state == 1 && NF == 1 { sort() ; print ; state = 0 ; next }
END { if (state == 1) { sort() } }
' input_file
You can also do it without a temporary file with an advanced enough awk
variant, one that allows for pipes:
awk 'NF > 1 { print | "sort -rnk4"; next }
{ close("sort -rnk4"); print }
' input_file
The advantage of this is a much simpler and shorter awk
program that does most of the stuff in the previous script implicitly.
Running either of those scripts over your input file generates, as requested, each separate grouping sorted on the fourth column (the final float):
2
1
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
4
2
A 26.274099 38.533298 32.624298 4
A 15.850300 28.130899 24.753901 4
A 32.098000 33.665699 20.372499 4
B 36.459000 29.662701 17.806299 5
5
3
B 1.326100 17.127600 39.600300 4
B 35.853001 43.095901 17.402901 4
B 31.686800 44.997501 16.619499 4
A 9.837760 41.103199 13.062300 5
A 17.501801 44.279202 8.005670 5
3
4
B 31.274700 8.726580 25.267599 4
A 19.032400 41.384701 19.456301 5
B 19.441900 24.286400 6.961680 4
1
5
B 48.973999 15.508400 5.099710 4
6
6
B 31.437799 30.415001 37.454399 4
A 39.311199 27.371201 35.473999 5
B 30.145800 13.256200 30.713800 4
A 42.586601 21.343800 23.280701 5
B 48.501999 1.277730 21.900600 4
B 29.186001 44.353699 9.057160 4