bashextractbam

Bash: Extract reads from BAM files based on read length


my first question on stack overflow and I hope you can help me.

Suppose a BAM file, from which I only want to extract the reads of a certain length (42 - 65 nt; column 10), but with the information of the remaining columns. Exemplary snippet:

VH00693:3:AAANGKTM5:1:1507:7438:26974_AGTTATAGAC 256 ENST00000438504.2 352 0 32M * 0 0 CCTGCAGGAATATGGCTCCATCTTCATGGGCG CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC NH:i:50 HI:i:4

VH00693:3:AAANGKTM5:1:1507:7438:26974_AGTTATAGAC 256 ENST00000438504.1 352 0 32M * 0 0 CCTGCAGGAATATGGCTCCATCTTCATGGGCGCCTGCAGGAATATGGCTCCATCTTCATGGGCG CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC NH:i:50 HI:i:4

My try was to access the BAM file (Initial.bam) with samtools view, and to search for substrings fitting to the desired read size, which are parsed into a new BAM file (Extract.bam).

samtools view -h Initial.bam | \awk 'substr($0,1,1)=="@" || ($10>=42 && $10<=65)'| \samtools view -b > Extract.bam

However, the Extract.bam only contains the extracted header section (starting with '@') of the Initial.bam. So, header extractions works, as well as parsing into a new bam file. The initial files do contain reads of the desired range, but at that point I do not know how to adapt my code snippet. Do you have any suggestions?


Solution

  • Found an adaptation that delivered the desired output.

    samtools view -h Initial.bam | awk '{ if ( substr($0,1,1) == "@" || (length($10) >= 42 && length($10) <= 65)) {print $0} }' | samtools view -b > Extract.bam