textawksplitcsplit

Splitting text files on two consecutive lines containing only one integer number


I have a single long text file that contains a list os 3D coordinates. The beginning of the file is composed by a header like this:

10112
2455
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1

After that starts the list of coordinates. All the lines are composed by 3 to 7 numbers. For example:

0.001686 0.812066 -1.686245 0.074434
0.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
...

The total length of the list is equal to the product of the first two numbers of the header (10112*2455). These are PTX files, that contain 3D points from laser scanning in text format.

The point is that the file is a concatenation of headers and coordinates, and I want to split the file breaking it on the header. The ideal solution would split the file on the two consecutive single integer lines. I was looking for a generic solution using, for example, csplit, but csplit reads one line at a time, so it cannot detect the two consecutive lines.

As last resort, I will write a piece of software by myself, but I prefer to find a solution based on CLI tools (Awk?), if available.

Is there any idea?

Thank-you

Edit: examples

Let's say I have a file with the following content:

2
3
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
6.001686 0.812066 -1.686245 0.074434
3.001695 0.816359 -1.692300 0.087190
6.001699 0.818673 -1.694508 0.097398
2.001686 0.812066 -1.686245 0.074434
1.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
3                                         <--- cut before this line
1
421.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
421.417670 172.321300 1.704072 1
1.001686 0.812066 -1.686245 0.074434
2.001695 0.816359 -1.692300 0.087190
3.001699 0.818673 -1.694508 0.097398

In this case I should end up with two files, cut just before the first of the two lines composed by a single integer.

As an alternative, knowing that the two single number lines say how many points compose the section, we can say that the first output file is composed by the first 2*3+10=16 (10 lines of header and 6 of data) lines, and the second file is composed by the subsequent 3*1+10=13 (always 10 lines fo header and this time 3 of data) lines.


Solution

  • So you want to split a file into different ones, printing the header in all of them.

    This can do it, you just have to assign the number of lines to store in the parameter -v lines=XX and number of lines of header you want to store -v head=YY:

    awk -v lines=5 -v head=2
         'NR<=head{header[NR]=$1; next}
          !((NR-3)%lines) {file="output_"++count; for (i=1;i<=head;i++) print header[i] > file}
          {print > file}
         ' file
    

    One-liner:

    awk -v lines=5 -v head=2 'NR<=head{header[NR]=$1; next} !((NR-3)%lines) {file="output_"++count; for (i=1;i<=head;i++) print header[i] > file} {print > file}' file
    

    For your specific sample input, giving head=2 and lines=5, it returns two files:

    $ cat output_1
    10112
    2455
    121.417670 172.321300 1.704072
    0.997697 0.067831 -0.000222
    -0.067831 0.997697 0.000207
    0.000236 -0.000191 1.000000
    0.997697 0.067831 -0.000222 0
    $ cat output_2
    10112
    2455
    -0.067831 0.997697 0.000207 0
    0.000236 -0.000191 1.000000 0
    121.417670 172.321300 1.704072 1
    

    If what you want is to split the file for every header you find, this should do:

    awk '(!flag && NF==1) {header[1]=$1; flag=1; next} (flag && NF==1) {header[2]=$1; flag=0; file="output_"++count; printf "%d\n%d\n", header[1], header[2] > file; next} {print > file}' file
    

    Explanation

    Given your sample file, it returns output_1 and output_2:

    $ cat output_1
    2
    3
    121.417670 172.321300 1.704072
    0.997697 0.067831 -0.000222
    -0.067831 0.997697 0.000207
    0.000236 -0.000191 1.000000
    0.997697 0.067831 -0.000222 0
    -0.067831 0.997697 0.000207 0
    0.000236 -0.000191 1.000000 0
    121.417670 172.321300 1.704072 1
    6.001686 0.812066 -1.686245 0.074434
    3.001695 0.816359 -1.692300 0.087190
    6.001699 0.818673 -1.694508 0.097398
    2.001686 0.812066 -1.686245 0.074434
    1.001695 0.816359 -1.692300 0.087190
    0.001699 0.818673 -1.694508 0.097398
    $ cat output_2
    3
    1
    421.417670 172.321300 1.704072
    0.997697 0.067831 -0.000222
    -0.067831 0.997697 0.000207
    0.000236 -0.000191 1.000000
    0.997697 0.067831 -0.000222 0
    -0.067831 0.997697 0.000207 0
    0.000236 -0.000191 1.000000 0
    421.417670 172.321300 1.704072 1
    1.001686 0.812066 -1.686245 0.074434
    2.001695 0.816359 -1.692300 0.087190
    3.001699 0.818673 -1.694508 0.097398