I apologize if this is formatted incorrectly or if I am missing any information that would be helpful. I am attempting to run a for loop with a nested if statement for a couple of large datasets. The code is:
ID_index <- data.frame()
for (x in 1:length(peaks)) {
z <- ((mass_combo - peaks[x])/mass_combo)*10^6
if (length(which(abs(z) < 1)) > 0) {
ID_index[nrow(ID_index)+1, 1] <- peaks[x]
ID_index[nrow(ID_index), 2] <- toString(which(abs(z) < 1))
}
}
The vector peaks is a vector of length ~130,000 numerical components. The vector mass_combo is a numerical vector with over 150,000,000 components. What I am trying to do is index if each experimental value (peaks) is within a given error margin (z) of a vector of possible theoretical values (mass_combo). The theoretical values were previously calculated by taking all possible combinations of components using the expand.grid
function.
I had this loop running for approximately 2 hours and the loop only went through 5700 iterations. An example of the input and output would be:
peaks <- c(178.1161, 182.0530, 186.1223)
mass_combo <- c(154.1161, 166.1161, 178.1161, 190.1161, 202.1161, 214.1161, 226.1161, 170.053,
182.053, 194.053, 206.053, 709.0452, 721.0452, 182.0530, 194.0530, 186.1223,
198.1223, 210.1223)
ID_index
V1 V2
178.1161 3
182.0530 9, 14
186.1223 16
Is there a better method I can be performing this to index and identify theoretical value matches for each experimental result? Any assistance or suggestions would be greatly appreciated.
After some algebra to rearrange the comparison, the operation can be implemented as a data.table
non-equi join, which will be much faster than looping.
Demonstrating on the example data:
data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6, r = .I)][
data.table(y = peaks), .(peak = y, r), on = .(min < y, max > y)
][,.(idx = .(r)), peak]
#> peak idx
#> <num> <list>
#> 1: 178.1161 3
#> 2: 182.0530 9,14
#> 3: 186.1223 16
Or to get the mass_combo
values instead of indices, per the comments:
data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6)][
data.table(y = peaks), .(peak = y, x), on = .(min < y, max > y)
][,.(mass_combo = .(x)), peak]
#> peak mass_combo
#> <num> <list>
#> 1: 178.1161 178.1161
#> 2: 182.0530 182.053,182.053
#> 3: 186.1223 186.1223
Timing a dataset with the approximate sizes of your actual data:
peaks <- runif(13e4)
mass_combo <- runif(15e7)
system.time(
ID_index <-
data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6, r = .I)][
data.table(y = peaks), .(peak = y, r), on = .(min < y, max > y)
][,.(idx = .(r)), peak]
)
#> user system elapsed
#> 71.22 4.47 25.64
The first 10 rows:
ID_index[1:10]
#> peak idx
#> <num> <list>
#> 1: 0.54024270 267797,2733918,3335681,3499464,3800516,4405793,...
#> 2: 0.52608405 263261, 383398,1462884,1510015,2504774,2521597,...
#> 3: 0.41793366 847991,1156169,1213597,1489392,1680001,3838945,...
#> 4: 0.24143285 512967,1296863,3515676,4526957,5952832,6020254,...
#> 5: 0.89169524 285889, 785031, 829258, 966923,1044726,2564670,...
#> 6: 0.09785744 20626427,23309899,23498292,23712990,23870867,35958767,...
#> 7: 0.07118160 681958, 7164888,10311600,11386095,12753834,34130008,...
#> 8: 0.27893975 1378505,2577870,3153625,3320806,5095727,9098257,...
#> 9: 0.01651383 1503323, 5273696,27981236,30451611,42380615,42807730,...
#> 10: 0.61528940 2191889,2570098,2979545,3398146,4041039,4652609,...
Compare against a looping approach on vectors that are 10x smaller:
peaks <- runif(13e3)
mass_combo <- runif(15e6)
system.time({ # from @ThomasIsCoding
subset(
data.frame(
V1 = peaks,
V2 = unlist(
lapply(
peaks,
function(x) {
toString(which(abs(1 - x / mass_combo) < 1e-6))
}
), FALSE, FALSE
)
),
nzchar(V2)
)
})
#> user system elapsed
#> 920.64 542.20 1463.61
system.time(
data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6, r = .I)][
data.table(y = peaks), .(peak = y, r), on = .(min < y, max > y)
][,.(idx = .(r)), peak]
)
#> user system elapsed
#> 5.15 0.26 2.14