I am cleaning up sequencing data.
I have a file with reads (names) I'd like to find with grep
and eventually remove.
The patternfile file is 219,721 lines long, with no duplicate entries. The sequence .fastq
file is 557,514,608 lines long with no names that are duplicated.
I used:
grep -f patternfile.txt sequencefile.fastq > outputfile.txt
I am expecting the output file to be identical to the patternfile (except with 1:N:0:ACTGAT
included at the end) but instead the outputfile has 135 complete extra lines (names). These extra names are not duplicates, and are not found in the patternfile. I can open the output file and identify the extra lines. An example is shown below for patternfile lines 340-342:
@NB501827:133:HMV5HAFX2:1:11101:13856:12920
@NB501827:133:HMV5HAFX2:1:11101:16016:12934
@NB501827:133:HMV5HAFX2:1:11101:19446:12943
The output file is identical up to line 341 as shown below:
@NB501827:133:HMV5HAFX2:1:11101:13856:12920 1:N:0:ACTGAT
@NB501827:133:HMV5HAFX2:1:11101:26336:12921 1:N:0:ACTGAT
@NB501827:133:HMV5HAFX2:1:11101:16016:12934 1:N:0:ACTGAT
@NB501827:133:HMV5HAFX2:1:11101:19446:12943 1:N:0:ACTGAT
Note that the erroneous extra line at line 341 @NB501827:133:HMV5HAFX2:1:11101:26336:12921 1:N:0:ACTGAT
is present, and is just one example of the other 134 extra lines.
Why am I doing this?
This is a paired end read sequencing experiment and I found 219,721 instances where the reads in the "sequencefileR2" file were a string of 75 "G" and obvious errors due to the sequencing. I was able to use grep to pull out these sequence names, and now want to remove the corresponding reads in both files (sequencefileR1 and sequencefileR2). The plan was to use the inverse flag of grep
(e.g. grep -v
) to produce sequence files without these specific sequences. I checked the grep output before producing the final files and found this issue.
What have I tried?
I have tried making sure there are no Windows (DOS) end-of-lines present.
I have tried including the 1:N:0:ACTGAT
in the patternfile
I have tried this command on three different filesystems (CentOS7, Gitbash, Cygwin) with identical results (always get exactly the same output).
I have tried egrep
I have used the patternfile individual lines 340, 341, and 342 (and also the erroneous output line) shown above, and only get one output line from the sequence file (e.g.)
grep @NB501827:133:HMV5HAFX2:1:11101:13856:12920 sequencefileR2.fastq
@NB501827:133:HMV5HAFX2:1:11101:13856:12920 1:N:0:ACTGAT
I have tried removing the @
symbol from every line of the patternfile but got identical results.
I have tried putting the grep in a loop (which didn't work, they were amateur attempts)
for pattern in 'R1-R2-names.txt'; do grep "$pattern"
L_lactis_S1_LALL_R2_001.fastq >> loopr1names; done
for pattern in 'cat R1-R2-names.txt'; do grep "$pattern"
L_lactis_S1_LALL_R2_001.fastq >> loopr1names; done
I'm open to sed
and awk
solutions but would like to understand why this simple bash solution is not working. Thanks.
Use
grep -w -F -f patternfile.txt sequencefile.fastq > outputfile.txt
-w
means to match the pattern only when it's surrounded by word boundaries. -F
means to match fixed-text patterns, not regular expressions (this is probably not significant here, as your patterns don't seem to contain any characters that have special meaning, but it's good practice).
I suspect your pattern file contains a prefix of @NB501827:133:HMV5HAFX2:1:11101:26336:12921
, so it's matching this line. The -w
option will will prevent matching these prefixes.