Merge by Range in R - Applying Loops

I posted a question here: Matched Range Merge in R about merging two files based on a number in one file falling into a range in the second file. Thus far, I have been unsuccessful in piecing together code to accomplish this. The issue I am having is that the code I'm using compares the files line by line. This is a problem because 1.) One file is much longer than the other file, and 2.) I need the lines in the shorter file to be scanned through every range pair in the longer file - not just the range in the same row.

I have been working with the functions posted in the original question, and I feel like there should be a way to apply it to a more general loop that compares every line in the first file to each line in the second file, but I haven't figured it out yet. If anyone has any suggestions, I would appreciate it.

**** EDITED.

The nature of the data is this: Each range is not necessarily unique, although most are. They are also not of equal size, and some fall completely within others. findInterval therefore produces an error, because the ranges cannot be sorted in order to fall in "non-decending" order.

Here are first 6 lines of each data frame:

file1test <- data.frame(SNP=c("rs2343", "rs211", "rs754", "rs854", "rs343", "rs626"), BP=c(860269, 369640, 861822, 367934, 706940, 717244))


file2 <- data.frame(Gene=c("E613", "E92", "E49", "E3543", "E11", "E233"), BP_start=c(367640, 621059, 721320, 860260, 861322, 879584), BP_end = c(368634, 622053, 722513, 879955, 879533, 894689))

So, as you can see, the range on the 5th line lies within the range on the 4th line, and two SNPs from the first file fall within the range on the 4th line, but only one falls within the range on the second line.

The first file, which contains the SNPs, has only ~400 lines. However the second file, containing the ranges, has about 20K. What I would like to produce as an output is a data frame containing the lines from the first file (the SNPs) with BPs that fall into the BP range in the second file. If a SNP falls into two ranges, then it would appear twice, etc.

Solution

The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that

con <- textConnection("SNP     BP
rs064   12292
rs319   345367
rs285   700042")
snps <- read.delim(con, head=TRUE, sep="")

con <- textConnection("Gene    BP_start    BP_end
E613    345344      363401
E92     694501      705408
E49     362370      368340") ## missing trailing digit on BP_end??
genes <- read.delim(con, head=TRUE, sep="")

then create 'IRanges' out of each

library(IRanges)
isnps <- with(snps, IRanges(BP, width=1, names=SNP))
igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene))

(pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start - 1). Then find the SNPs ('query' in IRanges terminology) that overlap the genes ('subject')

olaps <- findOverlaps(isnps, igenes)

two of the SNPs overlap

> queryHits(olaps)
[1] 2 3

and they overlap the first and second genes

> subjectHits(olaps)
[1] 1 2

If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as

> cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
    SNP     BP Gene BP_start BP_end
2 rs319 345367 E613   345344 363401
3 rs285 700042  E92   694501 705408

Usually genes and SNPs have chromosome and strand ('+', '-', or '*' to indicate that strand isn't important) information, and you'd want to do overlaps in the context of these; instead of creating 'IRanges' instances, you'd create 'GRanges' (genomic ranges) and the subsequent book-keeping would be taken care of for you

library(GenomicRanges)
isnps <-
    with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
igenes <-
    with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))