How can I improve this for loop to index specific lines of a vector with a large dataset

I apologize if this is formatted incorrectly or if I am missing any information that would be helpful. I am attempting to run a for loop with a nested if statement for a couple of large datasets. The code is:

ID_index <- data.frame()
for (x in 1:length(peaks)) {
  z <- ((mass_combo - peaks[x])/mass_combo)*10^6
  if (length(which(abs(z) < 1)) > 0) {
    ID_index[nrow(ID_index)+1, 1] <- peaks[x]
    ID_index[nrow(ID_index), 2] <- toString(which(abs(z) < 1))
  }
}

The vector peaks is a vector of length ~130,000 numerical components. The vector mass_combo is a numerical vector with over 150,000,000 components. What I am trying to do is index if each experimental value (peaks) is within a given error margin (z) of a vector of possible theoretical values (mass_combo). The theoretical values were previously calculated by taking all possible combinations of components using the expand.grid function.

I had this loop running for approximately 2 hours and the loop only went through 5700 iterations. An example of the input and output would be:

peaks <- c(178.1161, 182.0530, 186.1223)
mass_combo <- c(154.1161, 166.1161, 178.1161, 190.1161, 202.1161, 214.1161, 226.1161, 170.053, 
                182.053, 194.053, 206.053, 709.0452, 721.0452, 182.0530, 194.0530, 186.1223, 
                198.1223, 210.1223)

ID_index
V1          V2
178.1161    3
182.0530    9, 14
186.1223    16

Is there a better method I can be performing this to index and identify theoretical value matches for each experimental result? Any assistance or suggestions would be greatly appreciated.

Solution

After some algebra to rearrange the comparison, the operation can be implemented as a data.table non-equi join, which will be much faster than looping.

Demonstrating on the example data:

data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6, r = .I)][
  data.table(y = peaks), .(peak = y, r), on = .(min < y, max > y)
][,.(idx = .(r)), peak]
#>        peak    idx
#>       <num> <list>
#> 1: 178.1161      3
#> 2: 182.0530   9,14
#> 3: 186.1223     16

Or to get the mass_combo values instead of indices, per the comments:

data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6)][
  data.table(y = peaks), .(peak = y, x), on = .(min < y, max > y)
][,.(mass_combo = .(x)), peak]
#>        peak      mass_combo
#>       <num>          <list>
#> 1: 178.1161        178.1161
#> 2: 182.0530 182.053,182.053
#> 3: 186.1223        186.1223

Timing a dataset with the approximate sizes of your actual data:

peaks <- runif(13e4)
mass_combo <- runif(15e7)

system.time(
  ID_index <-
    data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6, r = .I)][
      data.table(y = peaks), .(peak = y, r), on = .(min < y, max > y)
    ][,.(idx = .(r)), peak]
)
#>    user  system elapsed 
#>   71.22    4.47   25.64

The first 10 rows:

ID_index[1:10]
#>           peak                                                       idx
#>          <num>                                                    <list>
#>  1: 0.54024270        267797,2733918,3335681,3499464,3800516,4405793,...
#>  2: 0.52608405        263261, 383398,1462884,1510015,2504774,2521597,...
#>  3: 0.41793366        847991,1156169,1213597,1489392,1680001,3838945,...
#>  4: 0.24143285        512967,1296863,3515676,4526957,5952832,6020254,...
#>  5: 0.89169524        285889, 785031, 829258, 966923,1044726,2564670,...
#>  6: 0.09785744 20626427,23309899,23498292,23712990,23870867,35958767,...
#>  7: 0.07118160   681958, 7164888,10311600,11386095,12753834,34130008,...
#>  8: 0.27893975       1378505,2577870,3153625,3320806,5095727,9098257,...
#>  9: 0.01651383  1503323, 5273696,27981236,30451611,42380615,42807730,...
#> 10: 0.61528940       2191889,2570098,2979545,3398146,4041039,4652609,...

Compare against a looping approach on vectors that are 10x smaller:

peaks <- runif(13e3)
mass_combo <- runif(15e6)

system.time({ # from @ThomasIsCoding
  subset(
    data.frame(
      V1 = peaks,
      V2 = unlist(
        lapply(
          peaks,
          function(x) {
            toString(which(abs(1 - x / mass_combo) < 1e-6))
          }
        ), FALSE, FALSE
      )
    ),
    nzchar(V2)
  )
})
#>    user  system elapsed 
#>  920.64  542.20 1463.61

system.time(
  data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6, r = .I)][
    data.table(y = peaks), .(peak = y, r), on = .(min < y, max > y)
  ][,.(idx = .(r)), peak]
)
#>    user  system elapsed 
#>    5.15    0.26    2.14