I have a very big Sequence Alignment Map (SAM) file as depicted below
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
I want to split the file based on column 3 so I can do awk '{print > $3}' file.txt
which is working fine. Now I want to keep the lines
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
as header on top of all the splitted files, how can I do that?
I tried this:
awk '$1 ~ /^@/ {print > $3}' file.txt
You have to keep track of whether the file is one you have seen before, and if not, write the header before you write to it for the first time.
awk '$1 ~ /^@/ { header = header $0 ORS; next }
!seen[$3]++ { printf "%s", header >$3 }
{ print > $3 }' file.txt
The internal variable ORS
usually contains a newline but it's customary to use the variable so that you only need to change the string in one place if you want to use a different output record separator.
This can run out of file handles if you have more than a couple of dozen distinct values in $3
but if your script otherwise works, I guess that's not a problem in your case.
(The brute-force fix is to close and reopen the file after each write, which makes the script much slower. A better fix if you have the memory is to collect all the results into RAM and only write when you have read all the data. A more sophisticated approach would keep a buffer of, say, 20 open file handles, and close the least recently used when you need to write to a file which isn't among them.)