my first question on stack overflow and I hope you can help me.
Suppose a BAM file, from which I only want to extract the reads of a certain length (42 - 65 nt; column 10), but with the information of the remaining columns. Exemplary snippet:
VH00693:3:AAANGKTM5:1:1507:7438:26974_AGTTATAGAC 256 ENST00000438504.2 352 0 32M * 0 0 CCTGCAGGAATATGGCTCCATCTTCATGGGCG CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC NH:i:50 HI:i:4
VH00693:3:AAANGKTM5:1:1507:7438:26974_AGTTATAGAC 256 ENST00000438504.1 352 0 32M * 0 0 CCTGCAGGAATATGGCTCCATCTTCATGGGCGCCTGCAGGAATATGGCTCCATCTTCATGGGCG CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC NH:i:50 HI:i:4
My try was to access the BAM file (Initial.bam) with samtools view, and to search for substrings fitting to the desired read size, which are parsed into a new BAM file (Extract.bam).
samtools view -h Initial.bam | \awk 'substr($0,1,1)=="@" || ($10>=42 && $10<=65)'| \samtools view -b > Extract.bam
However, the Extract.bam only contains the extracted header section (starting with '@') of the Initial.bam. So, header extractions works, as well as parsing into a new bam file. The initial files do contain reads of the desired range, but at that point I do not know how to adapt my code snippet. Do you have any suggestions?
Found an adaptation that delivered the desired output.
samtools view -h Initial.bam | awk '{ if ( substr($0,1,1) == "@" || (length($10) >= 42 && length($10) <= 65)) {print $0} }' | samtools view -b > Extract.bam