I have a single long text file that contains a list os 3D coordinates. The beginning of the file is composed by a header like this:
10112
2455
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
After that starts the list of coordinates. All the lines are composed by 3 to 7 numbers. For example:
0.001686 0.812066 -1.686245 0.074434
0.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
...
The total length of the list is equal to the product of the first two numbers of the header (10112*2455). These are PTX files, that contain 3D points from laser scanning in text format.
The point is that the file is a concatenation of headers and coordinates, and I want to split the file breaking it on the header. The ideal solution would split the file on the two consecutive single integer lines. I was looking for a generic solution using, for example, csplit, but csplit reads one line at a time, so it cannot detect the two consecutive lines.
As last resort, I will write a piece of software by myself, but I prefer to find a solution based on CLI tools (Awk?), if available.
Is there any idea?
Thank-you
Let's say I have a file with the following content:
2
3
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
6.001686 0.812066 -1.686245 0.074434
3.001695 0.816359 -1.692300 0.087190
6.001699 0.818673 -1.694508 0.097398
2.001686 0.812066 -1.686245 0.074434
1.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
3 <--- cut before this line
1
421.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
421.417670 172.321300 1.704072 1
1.001686 0.812066 -1.686245 0.074434
2.001695 0.816359 -1.692300 0.087190
3.001699 0.818673 -1.694508 0.097398
In this case I should end up with two files, cut just before the first of the two lines composed by a single integer.
As an alternative, knowing that the two single number lines say how many points compose the section, we can say that the first output file is composed by the first 2*3+10=16 (10 lines of header and 6 of data) lines, and the second file is composed by the subsequent 3*1+10=13 (always 10 lines fo header and this time 3 of data) lines.
So you want to split a file into different ones, printing the header in all of them.
This can do it, you just have to assign the number of lines to store in the parameter -v lines=XX
and number of lines of header you want to store -v head=YY
:
awk -v lines=5 -v head=2
'NR<=head{header[NR]=$1; next}
!((NR-3)%lines) {file="output_"++count; for (i=1;i<=head;i++) print header[i] > file}
{print > file}
' file
One-liner:
awk -v lines=5 -v head=2 'NR<=head{header[NR]=$1; next} !((NR-3)%lines) {file="output_"++count; for (i=1;i<=head;i++) print header[i] > file} {print > file}' file
For your specific sample input, giving head=2
and lines=5
, it returns two files:
$ cat output_1
10112
2455
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
$ cat output_2
10112
2455
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
awk '(!flag && NF==1) {header[1]=$1; flag=1; next} (flag && NF==1) {header[2]=$1; flag=0; file="output_"++count; printf "%d\n%d\n", header[1], header[2] > file; next} {print > file}' file
(!flag && NF==1) {header[1]=$1; flag=1; next}
if no flag is set, assume it is the first line of the header and store it.( flag && NF==1) {header[2]=$1; flag=0; file="output_"++count; printf "%d\n%d\n", header[1], header[2] > file; next}
if flag is set, it means that we already captured the first line of the header and we are in the second one. For this, unset the flag, generate the file name as output_
+ number
and populate with the stored header.{print > file}
on the rest of the cases, print the current line into the file.Given your sample file, it returns output_1
and output_2
:
$ cat output_1
2
3
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
6.001686 0.812066 -1.686245 0.074434
3.001695 0.816359 -1.692300 0.087190
6.001699 0.818673 -1.694508 0.097398
2.001686 0.812066 -1.686245 0.074434
1.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
$ cat output_2
3
1
421.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
421.417670 172.321300 1.704072 1
1.001686 0.812066 -1.686245 0.074434
2.001695 0.816359 -1.692300 0.087190
3.001699 0.818673 -1.694508 0.097398