rperformancefor-looplarge-data

How can I improve this for loop to index specific lines of a vector with a large dataset


I apologize if this is formatted incorrectly or if I am missing any information that would be helpful. I am attempting to run a for loop with a nested if statement for a couple of large datasets. The code is:

ID_index <- data.frame()
for (x in 1:length(peaks)) {
  z <- ((mass_combo - peaks[x])/mass_combo)*10^6
  if (length(which(abs(z) < 1)) > 0) {
    ID_index[nrow(ID_index)+1, 1] <- peaks[x]
    ID_index[nrow(ID_index), 2] <- toString(which(abs(z) < 1))
  }
}

The vector peaks is a vector of length ~130,000 numerical components. The vector mass_combo is a numerical vector with over 150,000,000 components. What I am trying to do is index if each experimental value (peaks) is within a given error margin (z) of a vector of possible theoretical values (mass_combo). The theoretical values were previously calculated by taking all possible combinations of components using the expand.grid function.

I had this loop running for approximately 2 hours and the loop only went through 5700 iterations. An example of the input and output would be:

peaks <- c(178.1161, 182.0530, 186.1223)
mass_combo <- c(154.1161, 166.1161, 178.1161, 190.1161, 202.1161, 214.1161, 226.1161, 170.053, 
                182.053, 194.053, 206.053, 709.0452, 721.0452, 182.0530, 194.0530, 186.1223, 
                198.1223, 210.1223)

ID_index
V1          V2
178.1161    3
182.0530    9, 14
186.1223    16

Is there a better method I can be performing this to index and identify theoretical value matches for each experimental result? Any assistance or suggestions would be greatly appreciated.


Solution

  • After some algebra to rearrange the comparison, the operation can be implemented as a data.table non-equi join, which will be much faster than looping.

    Demonstrating on the example data:

    data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6, r = .I)][
      data.table(y = peaks), .(peak = y, r), on = .(min < y, max > y)
    ][,.(idx = .(r)), peak]
    #>        peak    idx
    #>       <num> <list>
    #> 1: 178.1161      3
    #> 2: 182.0530   9,14
    #> 3: 186.1223     16
    

    Or to get the mass_combo values instead of indices, per the comments:

    data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6)][
      data.table(y = peaks), .(peak = y, x), on = .(min < y, max > y)
    ][,.(mass_combo = .(x)), peak]
    #>        peak      mass_combo
    #>       <num>          <list>
    #> 1: 178.1161        178.1161
    #> 2: 182.0530 182.053,182.053
    #> 3: 186.1223        186.1223
    

    Timing a dataset with the approximate sizes of your actual data:

    peaks <- runif(13e4)
    mass_combo <- runif(15e7)
    
    system.time(
      ID_index <-
        data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6, r = .I)][
          data.table(y = peaks), .(peak = y, r), on = .(min < y, max > y)
        ][,.(idx = .(r)), peak]
    )
    #>    user  system elapsed 
    #>   71.22    4.47   25.64
    

    The first 10 rows:

    ID_index[1:10]
    #>           peak                                                       idx
    #>          <num>                                                    <list>
    #>  1: 0.54024270        267797,2733918,3335681,3499464,3800516,4405793,...
    #>  2: 0.52608405        263261, 383398,1462884,1510015,2504774,2521597,...
    #>  3: 0.41793366        847991,1156169,1213597,1489392,1680001,3838945,...
    #>  4: 0.24143285        512967,1296863,3515676,4526957,5952832,6020254,...
    #>  5: 0.89169524        285889, 785031, 829258, 966923,1044726,2564670,...
    #>  6: 0.09785744 20626427,23309899,23498292,23712990,23870867,35958767,...
    #>  7: 0.07118160   681958, 7164888,10311600,11386095,12753834,34130008,...
    #>  8: 0.27893975       1378505,2577870,3153625,3320806,5095727,9098257,...
    #>  9: 0.01651383  1503323, 5273696,27981236,30451611,42380615,42807730,...
    #> 10: 0.61528940       2191889,2570098,2979545,3398146,4041039,4652609,...
    

    Compare against a looping approach on vectors that are 10x smaller:

    peaks <- runif(13e3)
    mass_combo <- runif(15e6)
    
    system.time({ # from @ThomasIsCoding
      subset(
        data.frame(
          V1 = peaks,
          V2 = unlist(
            lapply(
              peaks,
              function(x) {
                toString(which(abs(1 - x / mass_combo) < 1e-6))
              }
            ), FALSE, FALSE
          )
        ),
        nzchar(V2)
      )
    })
    #>    user  system elapsed 
    #>  920.64  542.20 1463.61
    
    system.time(
      data.table(x = mass_combo)[,`:=`(min = x - x/1e6, max = x + x/1e6, r = .I)][
        data.table(y = peaks), .(peak = y, r), on = .(min < y, max > y)
      ][,.(idx = .(r)), peak]
    )
    #>    user  system elapsed 
    #>    5.15    0.26    2.14