rsubsetbioinformaticsbioconductorbam

Subset SAM/BAM file in R


I have a BAM file with lots of reads. I can load it into R with scanBam from Rsamtools.

However, I only need a subset of reads. I have a character vector with the qnames I am interested in.

scanBam returns a list with 1 element which is a list with 13 elements which contain data for all the thousands of reads.

How can I subset this object by qname preserving the structure? I was not able to find anything in the manual or online.


Solution

  • It's probably more convenient to input the data with GenomicAlignments::readGAlignments, including the qname by specifying param=ScanBamParam(what="qname") as an argument. You could then subset with %in%. Here's a more complete example, using one of the ExperimentData packages

    library(GenomicAlignments)
    library(RNAseqData.HNRNPC.bam.chr14)
    
    fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]    
    want <- c("ERR127306.11930623", "ERR127306.24720935",
        "ERR127306.23706011", "ERR127306.22418829", "ERR127306.13372247",
        "ERR127306.20686334", "ERR127306.11412145", "ERR127306.4711647",
        "ERR127306.7479582", "ERR127306.12737243")
    aln <- readGAlignments(fname, param=ScanBamParam(what="qname"))
    aln[mcols(aln)$qname %in% want]
    

    BAM files are of course big, and the qnames are a big part of that; it often makes sense to iterate through the file in chunks. This is enabled (in the current Rsamtools) with yieldReduce, where one provides BamFile with yieldSize set to a reasonable (e.g., 1M) number of reads, a MAP function to input a chunk of data and process it (e.g., filtering the unwanted reads), an (optional) REDUCE function to concatenate results, and an (optional) DONE function to indicate when the iteration is done. The solution looks like (yieldSize is artificially small, to allow illustration with the sample data):

    bfl <- BamFile(fname, yieldSize=100000)  ## larger, e.g., 1M-5M
    MAP <- function(bfl, want) {
        ## message("tick")
        aln <- readGAlignments(bfl, param=ScanBamParam(what="qname"))
        if (length(aln) == 0)
            NA                          # semaphore -- DONE
        else
            aln[mcols(aln)$qname %in% want]
    }
    REDUCE <- c
    DONE <- function(x) identical(x, NA)
    result <- yieldReduce(bfl, MAP, REDUCE, DONE, want=want)
    

    One could adopt a similar approach using scanBam, but the data structure (list-of-lists) is more complicated to deal with:

    x <- scanBam(fname, param=ScanBamParam(what=c("qname", "pos")))
    keep <- lapply(lapply(x, "[[", "qname"), "%in%", want)
    result <- Map(function(elts, keep) {
        lapply(elts, "[", keep)
    }, x, keep)
    

    This could also be used with yieldReduce.

    If you're interested in creating a new bam file with the filtered reads, then

    filter_factory <- function(want) {
        list(KeepQname = function(x) x$qname %in% want)
    }
    filter <- FilterRules(filter_factory(want))
    dest <- filterBam(fname, tempfile(), filter=filter,
                      param=ScanBamParam(what="qname"))
    readGAlignments(dest)