I posted a question here: Matched Range Merge in R about merging two files based on a number in one file falling into a range in the second file. Thus far, I have been unsuccessful in piecing together code to accomplish this. The issue I am having is that the code I'm using compares the files line by line. This is a problem because 1.) One file is much longer than the other file, and 2.) I need the lines in the shorter file to be scanned through every range pair in the longer file - not just the range in the same row.
I have been working with the functions posted in the original question, and I feel like there should be a way to apply it to a more general loop that compares every line in the first file to each line in the second file, but I haven't figured it out yet. If anyone has any suggestions, I would appreciate it.
**** EDITED.
The nature of the data is this: Each range is not necessarily unique, although most are. They are also not of equal size, and some fall completely within others. findInterval
therefore produces an error, because the ranges cannot be sorted in order to fall in "non-decending" order.
Here are first 6 lines of each data frame:
file1test <- data.frame(SNP=c("rs2343", "rs211", "rs754", "rs854", "rs343", "rs626"), BP=c(860269, 369640, 861822, 367934, 706940, 717244))
file2 <- data.frame(Gene=c("E613", "E92", "E49", "E3543", "E11", "E233"), BP_start=c(367640, 621059, 721320, 860260, 861322, 879584), BP_end = c(368634, 622053, 722513, 879955, 879533, 894689))
So, as you can see, the range on the 5th line lies within the range on the 4th line, and two SNPs from the first file fall within the range on the 4th line, but only one falls within the range on the second line.
The first file, which contains the SNPs, has only ~400 lines. However the second file, containing the ranges, has about 20K. What I would like to produce as an output is a data frame containing the lines from the first file (the SNPs) with BPs that fall into the BP range in the second file. If a SNP falls into two ranges, then it would appear twice, etc.
The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that
con <- textConnection("SNP BP
rs064 12292
rs319 345367
rs285 700042")
snps <- read.delim(con, head=TRUE, sep="")
con <- textConnection("Gene BP_start BP_end
E613 345344 363401
E92 694501 705408
E49 362370 368340") ## missing trailing digit on BP_end??
genes <- read.delim(con, head=TRUE, sep="")
then create 'IRanges' out of each
library(IRanges)
isnps <- with(snps, IRanges(BP, width=1, names=SNP))
igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene))
(pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start - 1). Then find the SNPs ('query' in IRanges terminology) that overlap the genes ('subject')
olaps <- findOverlaps(isnps, igenes)
two of the SNPs overlap
> queryHits(olaps)
[1] 2 3
and they overlap the first and second genes
> subjectHits(olaps)
[1] 1 2
If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as
> cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
SNP BP Gene BP_start BP_end
2 rs319 345367 E613 345344 363401
3 rs285 700042 E92 694501 705408
Usually genes and SNPs have chromosome and strand ('+', '-', or '*' to indicate that strand isn't important) information, and you'd want to do overlaps in the context of these; instead of creating 'IRanges' instances, you'd create 'GRanges' (genomic ranges) and the subsequent book-keeping would be taken care of for you
library(GenomicRanges)
isnps <-
with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
igenes <-
with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))