I have a data.frame
of linear intervals (genomic coordinates of mapped RNA-seq reads), for example:
df <- data.frame(seqnames = c(rep("chr10",2),rep("chr5",8)),
start = c(12255935,12257004,12243635,12244009,12253879,12254395,12254506,12255142,12255229,12258719),
end = c(12257002,12258512,12243764,12244291,12254107,12254501,12254515,12255535,12255312,12258764),
read_id = c(rep("R9",2),rep("R10",8)),
stringsAsFactors = F)
For some reads there are intervals that are contained or are intersected within others of the same read, and I'd like to merge them. In the example above for read_id = "R10"
, interval: chr5 12255229 12255312
is contained within interval chr5 12255142 12255535
.
For a single read data.frame
, I use this procedure:
#defining helper functions
clusterHits <- function(overlap.hits)
{
overlap.hits <- GenomicRanges::union(overlap.hits,t(overlap.hits))
query.hits <- S4Vectors::queryHits(overlap.hits)
search.hits <- S4Vectors::subjectHits(overlap.hits)
cluster.ids <- seq_len(S4Vectors::queryLength(overlap.hits))
while(TRUE){
hit <- S4Vectors::Hits(query.hits,cluster.ids[search.hits],S4Vectors::queryLength(overlap.hits),S4Vectors::subjectLength(overlap.hits))
tmp.cluster.ids <- pmin(cluster.ids,S4Vectors::selectHits(hit,"first"))
if(identical(tmp.cluster.ids,cluster.ids))
break
cluster.ids <- tmp.cluster.ids
}
unname(S4Vectors::splitAsList(seq_len(S4Vectors::queryLength(overlap.hits)),cluster.ids))
}
mergeConnectedRanges <- function(x.gr,overlap.hits)
{
cluster.ids <- clusterHits(overlap.hits)
merged.gr <- range(IRanges::extractList(x.gr,cluster.ids))
merged.gr <- unlist(merged.gr)
S4Vectors::mcols(merged.gr)$merged.idx <- cluster.ids
return(merged.gr)
}
#Now separate R10 and merge its intervals
df1 <- dplyr::filter(df, read_id == "R10")
gr <- GenomicRanges::GRanges(dplyr::select(df1,seqnames,start,end))
redundant.intervals <- GenomicRanges::findOverlaps(gr,ignore.strand=T)
query.gr <- redundant.intervals[S4Vectors::queryHits(redundant.intervals)]
subject.gr <- redundant.intervals[S4Vectors::subjectHits(redundant.intervals)]
as.data.frame(mergeConnectedRanges(x.gr=gr,overlap.hits=redundant.intervals))
Which gives:
seqnames start end width strand merged.idx
1 chr5 12243635 12243764 130 * 1
2 chr5 12244009 12244291 283 * 2
3 chr5 12253879 12254107 229 * 3
4 chr5 12254395 12254501 107 * 4
5 chr5 12254506 12254515 10 * 5
6 chr5 12255142 12255535 394 * 6, 7
7 chr5 12258719 12258764 46 * 8
So the merged.idx
shows that intervals 6 and 7 in df1
have been merged.
I'm looking for a fast way of doing this across thousands of reads. The obvious way is to use do.call
across the unique reads in df
:
library(dplyr)
do.call(rbind, lapply(unique(df$read_id), function(r){
read.df <- dplyr::filter(df, read_id == r)
gr <- GenomicRanges::GRanges(dplyr::select(read.df,seqnames,start,end))
redundant.intervals <- GenomicRanges::findOverlaps(gr,ignore.strand=T)
query.gr <- redundant.intervals[S4Vectors::queryHits(redundant.intervals)]
subject.gr <- redundant.intervals[S4Vectors::subjectHits(redundant.intervals)]
as.data.frame(mergeConnectedRanges(x.gr=gr,overlap.hits=redundant.intervals)) %>%
dplyr::mutate(read_id = r)
}))
But I'm wondering if there's any faster way. Note that the fraction of reads that actually have such intersecting intervals is relatively small.
Using the GenomicRanges
package from the Bioconductor repository, the task can be accomplished by a few lines of code:
library(GenomicRanges)
makeGRangesListFromDataFrame(df, split.field = "read_id") |>
reduce(with.revmap = TRUE) |>
as.data.frame()
group group_name seqnames start end width strand revmap 1 1 R10 chr5 12243635 12243764 130 * 1 2 1 R10 chr5 12244009 12244291 283 * 2 3 1 R10 chr5 12253879 12254107 229 * 3 4 1 R10 chr5 12254395 12254501 107 * 4 5 1 R10 chr5 12254506 12254515 10 * 5 6 1 R10 chr5 12255142 12255535 394 * 6, 7 7 1 R10 chr5 12258719 12258764 46 * 8 8 2 R9 chr10 12255935 12257002 1068 * 1 9 2 R9 chr10 12257004 12258512 1509 * 2
As the GenomeRanges
package is not on CRAN, please, see the vignette Installing and Managing Bioconductor Packages or run
install.packages("BiocManager")
BiocManager::install("GenomicRanges")
df <- data.frame(seqnames = c(rep("chr10", 2), rep("chr5", 8)),
start = c(12255935, 12257004, 12243635, 12244009, 12253879, 12254395, 12254506, 12255142, 12255229, 12258719),
end = c(12257002, 12258512, 12243764, 12244291, 12254107, 12254501, 12254515, 12255535, 12255312, 12258764),
read_id = c(rep("R9", 2), rep("R10", 8)),
stringsAsFactors = FALSE)
As the