I am trying to develop a function in R to output random positions in list of given intervals.
My intervals file (14,600 lines) is a tab delimited bed
file (chromosome start end name
) that looks like this:
1 4953 16204 1
1 16284 16612 1
1 16805 17086 1
1 18561 18757 1
1 18758 19040 1
1 19120 19445 1
Currently my function will generate N
random positions within these intervals.
sim_dat <- bpSim(N=10)
head(sim_dat)
seqnames start end width strand
1 1 22686939 22686939 1 *
2 1 14467770 14467770 1 *
3 2 10955472 10955472 1 *
4 X 823201 823201 1 *
5 6 10421738 10421738 1 *
6 17 21827745 21827745 1 *
library(GenomicRanges)
library(rtracklayer)
bpSim <- function(intervals="intervals.bed", N=100, write=F) {
intFile <- import.bed(intervals)
space <- sum(width(intFile))
positions <- sample(c(1:space), N)
cat("Simulating", N, "breakpoints", sep = " ", "\n")
new_b <- GRanges(
seqnames = as.character(rep(seqnames(intFile), width(intFile))),
ranges = IRanges(start = unlist(mapply(seq, from = start(intFile), to = end(intFile))), width = 1)
)
bedOut <- new_b[positions]
if (write) {
export.bed(new_b[positions], "simulatedBPs.bed")
}
remove(new_b)
return(data.frame(bedOut))
}
This works, however as I'm not particularly familiar with the GenomicRanges package it's something that I've rather hacked together. I would much prefer to be able to re-write this using base R
or packages from tidyverse
, so that I can adjust it to, for example, allow the user to specify the chromosome.
It also takes a long time - even for N=10
:
system.time(sim_dat <- bpSim(N=10))
Simulating 10 breakpoints
user system elapsed
10.689 3.267 13.970
Ultimately, I'm trying to simulate random positions in a genome, and would therefore need to simulate data hundreds of times for each N
.
I would greatly appreciate any advice on how I can:
GenomicRanges
In addition - if anyone knows any packages that does this already I would much rather use an exiting package rather than re-invent the wheel.
With ranges being different lengths, I'm assuming you want these randomly chosen positions to be proportional to the length of the segment. In other words, selection is uniform based on actual base pairs within the ranges. Otherwise you'll be over-representing small ranges (higher marker density) and under-representing large ranges (lower marker density).
Here's a data.table solution that can do one thousand sites pretty much instantly, and one million random sites in about 10 seconds on my machine. It randomly samples the number of sites you want, first by sampling rows (weighted by each row's range size), then sampling uniformly within that range.
library(data.table)
nSites <- 1e4
bed <- data.table(chromosome=1, start=c(100,1050,3600,4000,9050), end=c(1000,3000,3700,8000,20000))
# calculate size of range
bed[, size := 1 + end-start]
# Randomly sample bed file rows, proportional to the length of each range
simulated.sites <- bed[sample(.N, size=nSites, replace=TRUE, prob=bed$size)]
# Randomly sample uniformly within each chosen range
simulated.sites[, position := sample(start:end, size=1), by=1:dim(simulated.sites)[1]]
# Remove extra columns and format as needed
simulated.sites[, start := position]
simulated.sites[, end := position]
simulated.sites[, c("size", "position") := NULL]
This starts with a table such as:
chromosome start end size
1 100 1000 901
1 1050 3000 1951
1 3600 3700 101
1 4000 8000 4001
1 9050 20000 10951
With an output such as:
chromosome start end
1: 1 10309 10309
2: 1 4578 4578
3: 1 1984 1984
4: 1 14703 14703
5: 1 10090 10090
---
9996: 1 1601 1601
9997: 1 5317 5317
9998: 1 18918 18918
9999: 1 1154 1154
10000: 1 7343 7343