I have a fastq file below and I want to split the file by lane=$2
. My code does the job of splitting it, but I also want the output files to have $SM variable appended to them. Can someone please let me know what I am missing in my command?
SM="sample1"
awk 'BEGIN {FS = ":"} {lane=$2 ; print > "${SM}."lane".fastq" ; for (i = 1; i <= 3; i++) {getline ; print > "${SM}."lane".fastq"}}' < File.fastq
File.fastq
@HS2000-1015_160:7:1108:13370:100570/2
CTTGACTGCCAGAGACGCTCCTTTGCAATGCCTTCCGGTAACCAAATTTTTGGGCACAACACACAGCTGGCCTTCATTTCTTCAGGGGCTGGTAAACAGA
+
@@@ADFFFHHHFD=EF@:GHIIFHH<ECHGF@DDBB:6@D60?F=888)8='--(=5@EAE5?'(..((.;?@>>A>3;@####################
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:6:2116:4077:79041/2
GGTCCCCGCCTACGCCCACTGGGTTGGTGCACCTGGTGGTGGTGGCCGCCAAGAAGCTGGTGAACCGCCTCCAAGTGGCTCCCAAGACGCAGCTGGATGA
+
CCCFFFFFHHFHHJHJJJJJJJJGGHJHIGAGIIJIFHJ;@F;CHHFHFDDDDDCDDCDD9CCCDDBDDBBDDCDACDD8@BD3>?BCDBDDDACCDC@>
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
@HS2000-1015_160:6:2209:18284:44195/2
TAAAATGTCACAAAGCTGGAAACTCTTCCCTATCACAAACCAAAACTTAAAAGGACGTTACCTGGCTGGGTCTAAACTCCACATAACTCGCTTGCAGTTG
+
CCCFFFFEHHHGHJIIIJJIJJHIIJEHJJHIJJJIIJJIJIJJIJIIHJJIJGGHGHGIIHHIIIIHFH@DFFFDEEEECDDDCDDDDBDDBBDCDACC
@HS2000-1015_160:7:1215:18781:100685/2
ATAAAACAGTAAACAAAATAAAGTCAGTTTTTTTTTTTTTTTTTAAAGAACAAAATGAAACTTGAGGGAAAACTTCATGGAGTTACAGTTTATCCTGATA
+
CCCFFFFFHFHHFJJJJIIGIGI<CFHHIIJJJJJIJJHFDDD=ACC(38+9CB?:(>C(+:@>(4?05<?C?###########################
@HS2000-1015_160:6:1215:6292:43622/2
GGGTCCTGAGACCTGAGGGACCATTGGCCCTCTTCTGGCTTGCTTATCCTTTGTACCTGATGGCCAATGAATGTCAGAGATGGTCCTGTCTCCATCCAGT
+
BCCDFFFFHGHHHJJJIIJJJJJIJJJGIJIJJIHIIJJIEFHEIJJJJIGIGIIIIIJHFHIJJJJIHGHEC?BCEFFFEECCCEACCCCDDDDDDCCC
@HS2000-1015_160:7:2311:1291:4696/2
GATCTGGTGCTCGTATTCCATCCACCTCCCAAGCTATACATAATAACGGCCAAAGGACCTGGATGAAAGTGTCTGAAGCAGTTGTGTGTGTCTCACCTTC
+
?=?ABDDBCFDFHGGHBFCHHGD@GFDGCBDFGFFECCHHD@DDFHJEIIHGG3CE9C(7@E(.7=?;;@C?@ECA>@C3A(;A-5595<9:AC3@AC:A
@HS2000-1015_160:7:1205:18979:53766/2
TCTTGTTTTGACCAATAGTAAAGCACATTTCTCTAATTTGGATTTCTACAATATCCATATCTTGGTTTATGAAAGGTAGGGAAGAGACTTCAGGTACTGC
+
CCCFFDFFHHHHHIJIJJJIHIJHJJIJJJJIJIGIIIJJJJJHJIJJIJDHIJIIIIIJJJJIJGIJJJIIIGEEGCD@AHHFFEDFFCDDDDCCDD@C
@HS2000-1015_160:7:1205:5641:24287/2
ATAAGAAGGGAAGAATGATTAGGTGTCAAATGTTCTTTTTATTTTCTTTCAGTTCAATGCAAAAACTTTCCAGTGATTATGTAAATGCAGAATCATGTGG
+
CCCFFFFFHHHGHJIJJFJJGIGEHEHIJJJJGJGJJIJJJJJJJJJJJIJIIIJJJIJJIEHGIHGJJJJIGGGHIIIIEEEHCHHC>DFBEEA@CCCC
@HS2000-1015_160:7:1310:19879:73973/2
TTCTTGAGTTCTGATACCTGTTTCCACAATCGTTTCTGTTTCTGTTGTCTCCAGCCCATCCATGCTGTCCTCATCTTCCACTGCAGTTTTCACCCTACTT
+
@<@FFFDFHHH>FGGIJAEFHABHHIAGHAE=F@EF?FB@F:F<GGBGEHGGG9F=BGAGIIIHH;=.=CHG@CEHE3)7?=>)7@C>)(6(.6;A?ACC
@HS2000-1015_160:7:1215:4243:29984/2
ATCTACACCCAAAACAGAACTTTCACAAAAAAACTGTTGATACGAAGCTCATGAAAATCATGATGAATACTCCAACAATTAATGAATAAAACTATACAAT
+
;@@A;D;ADDFHFIIF3EG@A>ACEHE>EH=:DH@<9DB@F?B7C87'@)=)7@>@7==)7...).;?@C)6;((;(5;(>A:(:3;@3>:@>:@(4@::
@HS2000-1015_160:7:1314:6987:62989/2
ATAGCTGTCTGTTCAGAGTCTGATGTTTTCAGTAACACTCTTGATACATTAAGTGAGATAGAATGGAATCCAGCAACAAAGCTACTAAATCAGGTAACTT
+
C@CFFFFDHHHHHJIJJJBHHIIIIHJIJHGJIJJIEHGHJJIJJJJJJJJIGBGHHIJGHGIIHJJIJIIJIGIGHIGGGCHHHHBEFCCEFE>CCEEE
@HS2000-1015_160:6:1208:20370:97766/2
TTTACTTTTTCCCAAACAATAATGATGATAATGTGGCCATACTGGTGCATGAGGGCTCTTATTAAGGATAGGGGCCATGTCAGGCTCTATTGACTCCTAT
+
CCCFFFFFDHDFHJJJIJJJIIJGHJJJIIIIGHIJJIJJJIJIHIJJIIHGHIFHIFHJGIJJIJJJJJJJJHHHFFFFFEEEEEDDCDEDDDDDDCDD
@HS2000-1015_160:6:1108:20693:2521/2
CCCATTTTCTGATGAGGAAACAGGATCAGGGACATTGAGACCTACCAAAGTTACATAATACCAGTAGTAGAAATGGGACTTCAACACAGGCCTCTTGACT
+
7@@DDDDDHHHBDIGIB@F?A+AF@3+2AFE@1:BFE??HH6?BG9BD99??F49BC=88=:;F8=77/@EH=EHF9)=A>C>7?;(6@???C?>@####
@HS2000-1015_160:6:1206:11472:64908/2
AGTTTGTTGGACATTTGAGACCCCAGGAAATCCCCTTTCTCGTAACGTTCTCCGCTTGGATCTGATCTCAACAGGGTGTCGTAGTCATTCTTCAGCACAA
+
B@BDFFFFHHHHHIJGIIJIJJIJJJJGEGHHIJJJJJJIJIFFHIIHCHHIJJJGIIJH:CHHFFFFFFFEEEDD=@BDDDAB@DCDDDDDDD>CCB<?
@HS2000-1015_160:7:1114:4995:49287/2
CCTCCGCTCAGCACTGGCATTGGCATCGGTTTCTATGGCAACAGTGAGACCAGTGATGGGGTGTCCCAGCTCAGCTCTGCGCTGCTGCACGCCAACCACA
+
BCCFDFFFHHHHHJJJJJGHEIIJHIGIIFGHGIIIGHEHIIJJDHIJJJJJJEGIGGIDE:?BCEEAE@CCDCDDCDDDDDDDBCCDDD85?9BB@BDD
@HS2000-1015_160:7:1206:16723:26612/2
TTAGATATGCTGTATGTGAAGAAGAGGAGGTTAAAGAACACTGTTTTATGTAAATGTCTCATTCCTTATCCTACAGAAATTGCATTTTTAATTAAATCTT
+
BC@FFFFFHHHHHICIGGHEIGJJIJIEGHGHIJJGGIIIIJIFGIJJIIJIIIJJIIJJJJJIHHGJJGIIIIGIIIHIIFHGHFADFFFDFDE(;@CE
@HS2000-1015_160:5:2101:1745:52266/2
CCCCAGAATTCTCTTGTTTTTTCCTTGGTGATCCAGGAAAACGAAGCCCCCTCCTGTATTGACAGCTGGGAATTGTGGAGTCCACCGTCCTCCACCTGAG
+
C@CFFFFFHHHHHJIJJIJJJJJIIICHCEGIIIEHGIIHIJIGGGIJCHGIHHHGEFHHHGHEEFFDEDAC?CDDCDCD>95>:,,99@DCC?<AB9AC
Result file names I am getting:
${SM}.5.fastq
${SM}.6.fastq
${SM}.7.fastq
Result file names I want:
sample1.5.fastq
sample1.6.fastq
sample1.7.fastq
EDIT: As per OP's comment adding solution(improved one) including the output file name changing.
SM="sample1"
awk -v sm="$SM" '
BEGIN{FS = ":"}
/^@HS/{
split($1,arr,"_")
sub(/^@[a-zA-Z]+/,"",arr[1])
lane=$2
close(outputFile)
outputFile=sm"."arr[1]"."lane".fastq"
}
{
print >> (outputFile)
}' File.fastq
Fixing OP's attempt: Could you please try following, you could actually use -v awk_var_name="$shell_var"
for which I shared link in comments section too, I have also fixed few things too in your code.
SM="sample1"
awk -v sm="$SM" '
BEGIN{FS = ":"}
{
close(outputFile)
lane=$2
outputFile=sm count "."lane".fastq"
print > (outputFile)
for (i = 1; i <= 3; i++){getline ; print > (outputFile)}
}' File.fastq
Fixes in OP's attempts:
close
command to close the output file, so that we don't get too many file opened error
getline
is not much recommended so changed that approach to checking the line number check by doing FNR%4==0
Ideal way could be:
SM="sample1"
awk -v sm="$SM" '
BEGIN{FS = ":"}
/^@HS/{
lane=$2
close(outputFile)
outputFile=sm count "."lane".fastq"
}
{
print >> (outputFile)
}' File.fastq