rdistancechemistrylist-manipulationsnowfall

R: quickly match records based on custom distance function and multiple criteria


I made some functions in R to match up chemical mass spectra (a matrix with two columns with integer masses and intensities) to libraries of such spectra based on a custom spectral similarity function and matching of the so-called retention index of the compounds (i.e. the elution time) (see here for an example, http://webbook.nist.gov/cgi/cbook.cgi?ID=C630035&Mask=200). For that, the list element "RI" of each record has to be compared to that in the library and when the absolute deviation is less than a given tolerance it should add the best spectral library match to my records. Below is some code that I wrote to do this, but the problem is that it is far too slow for my purposes (I typically have about 1000 sample spectra and 200 000 library spectra). I tried parallelizing it but that also didn't seem to help much. Any thoughts perhaps on how I could make the code below more efficient, e.g. using more vectorization, or using inline C code, or some other R tricks? I am aware of the general advice in this respect, but don't quite see how to implement it easily in this case (and I am not so proficient in C yet unfortunately)... Any thoughts or advice? Oh yes, and how could I add a progress bar when using sfLapply? Would it help perhaps to put my spectra in one big (sparse, as there are lots of zeros) matrix first, to avoid the merge step in the spectral similarity function, or use additional criteria, such as only consider the spectra when the biggest/most intense peak in the query spectrum has the same mass as the library spectrum (or is contained in the set of say the 5 biggest peaks in the library spectrum)? Anyway, any thoughts on how to speed up this task would be much appreciated!

EDIT: one residual query I still have is how I could avoid making a complete copy of sample records recs in function addbestlibmatches1, but rather change only the records in place for which there is a library match? Also, passing the selection of the library records for which there is a retention index match is probably not efficient (in function addbestlibmatch). Any thoughts how I could avoid this?

# EXAMPLE DATA

rec1=list(RI=1100,spectrum=as.matrix(cbind(mz=c(57,43,71,41,85,56,55,70,42,84,98,99,39,69,58,113,156),int=c(999,684,396,281,249,173,122,107,94,73,51,48,47,47,37,33,32))))
randrec=function() list(RI=runif(1,1000,1200),spectrum=as.matrix(cbind(mz=seq(30,600,1),int=round(runif(600-30+1,0,999)))))

# spectral library
libsize=2000 # my real lib has 200 000 recs
lib=lapply(1:libsize,function(i) randrec())
lib=append(list(rec1),lib) 

# sample spectra
ssize=100 # I usually have around 1000
s=lapply(1:ssize,function(i) randrec())
s=append(s,list(rec1)) # we add the first library record as the last sample spectrum, so this should match



# SPECTRAL SIMILARITY FUNCTION

SpecSim=function (ms1,ms2,log=F) { 
  alignment = merge(ms1,ms2,by=1,all=T)
  alignment[is.na(alignment)]=0
  if (nrow(alignment)!=0) {
    alignment[,2]=100*alignment[,2]/max(alignment[,2]) # normalize base peak intensities to 100
    alignment[,3]=100*alignment[,3]/max(alignment[,3])
    if (log==T) {alignment[,2]=log2(alignment[,2]+1);alignment[,3]=log2(alignment[,3]+1)} # work on log2 intensity scale if requested
    u = alignment[,2]; v = alignment[,3]
    similarity_score = as.vector((u %*% v) / (sqrt(sum(u^2)) * sqrt(sum(v^2))))
    similarity_score[is.na(similarity_score)]=-1
    return(similarity_score)} else return(-1) }



# FUNCTION TO CALCULATE SIMILARITY VECTOR OF SPECTRUM WITH LIBRARY SPECTRA

SpecSimLib=function(spec,lib,log=F) {
  sapply(1:length(lib), function(i) SpecSim(spec,lib[[i]]$spectrum,log=log)) }



# FUNCTION TO ADD BEST MATCH OF SAMPLE RECORD rec IN SPECTRAL LIBRARY lib TO ORIGINAL RECORD
# we only compare spectra when list element RI in the sample record is within tol of that in the library
# when there is a spectral match > specsimcut within a RI devation less than tol,
# we add the record nrs in the library with the best spectral matches, the spectral similarity and the RI deviation to recs

addbestlibmatch=function(rec,lib,xvar="RI",tol=10,log=F,specsimcut=0.8) {
    nohit=list(record=-1,specmatch=NA,RIdev=NA)
    selected=abs(sapply(lib, "[[", xvar)-rec[[xvar]])<tol
    if (sum(selected)!=0) {
    specsims=SpecSimLib(rec$spectrum,lib[selected],log) # HOW CAN I AVOID PASSING THE RIGHT LIBRARY SUBSET EACH TIME?
    maxspecsim=max(specsims)
    if (maxspecsim>specsimcut) {besthsel=which(specsims==maxspecsim)[[1]] # nr of best hit among selected elements, in case of ties we just take the 1st hit
                                idbesth=which(selected)[[besthsel]] # record nr of best hit in library lib
                                return(modifyList(rec,list(record=idbesth,specsim=specsims[[besthsel]],RIdev=rec[[xvar]]-lib[[idbesth]][[xvar]])))}
                                else {return(rec)} } else {return(rec)}
}



# FUNCTION TO ADD BEST LIBRARY MATCHES TO RECORDS RECS

library(pbapply)
addbestlibmatches1=function(recs,lib,xvar="RI",tol=10,log=F,specsimcut=0.8) {
  pblapply(1:length(recs), function(i) addbestlibmatch(recs[[i]],lib,xvar,tol,log,specsimcut))
}

# PARALLELIZED VERSION
library(snowfall)
addbestlibmatches2=function(recs,lib,xvar="RI",tol=10,log=F,specsimcut=0.8,cores=4) {
  sfInit(parallel=TRUE,cpus=cores,type="SOCK")
  sfExportAll()
  sfLapply(1:length(recs), function(i) addbestlibmatch(recs[[i]],lib,xvar,tol,log,specsimcut))
  sfStop() 
}



# TEST TIMINGS

system.time(addbestlibmatches1(s,lib))
#|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
#user  system elapsed 
#83.60    0.06   83.82 

system.time(addbestlibmatches2(s,lib))
#user  system elapsed  - a bit better, but not much
#2.59    0.74   42.37 

Solution

  • Without looking at all of your code in detail, I think there is room for improvement in the SpecSim function without going all C++ yet. You're using merge, which coerces your matrices to data.frames. That's always going to be bad for performance. Most of your code time is probably in merge() and subsetting. data.frame subsetting is slow, matrix or vector is fast.

    SpecSim2 <- function (ms1,ms2,log=F) {
      i <- unique(c(ms1[,1], ms2[,1]))
      y <- x <- numeric(length(i))
      x[match(ms1[,1], i)] <- ms1[, 2]
      y[match(ms2[,1], i)] <- ms2[, 2]
      x <- 100 * x / max(x)
      y <- 100 * y / max(y)
      if (log){
        x <- log2(x + 1)
        y <- log2(y + 1)
      }
      similarity.score <- x %*% y / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
      if (is.na(similarity.score)){
        return(-1)
      } else {
        return(similarity.score)
      }
    }
    

    Here are some timings for the rewrite and the original, respectively:

    > system.time(addbestlibmatches1(s,lib))
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
       user  system elapsed 
       4.16    0.00    4.17
    
    > system.time(addbestlibmatches1(s,lib))
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
       user  system elapsed 
      34.25    0.02   34.34 
    

    May not get you the speed you need, but better than 8x improvement...

    As to how to deal with the addbestlibmatches(), I think you need to rethink the problem as a matrix problem. Rather than using lists to hold your library, use a vector for your RI values for both the library and the samples. Then you can do your initial screen like this:

    selected <- outer(sRI, libRI, FUN = '-') < 10
    

    If your library is one big matrix (masses x spectra), then you can subset the library for selected spectra and calculate the distance between a sample and the entire portion of the library that was selected in one pass like this:

    SpecSimMat <- function(x, lib, log = F){
      stopifnot(require(matrixStats))
      x <- 100 * x / max(x)
      lib <- sweep(lib, 2, colMaxs(lib))
      x %*% lib / (sqrt(sum(x^2)) * sqrt(colSums(lib^2)))
    

    }

    where x is a sample and lib is a matrix of the selected spectra (masses x spectra). In this way, your subsetting a matrix (fast) and then performing a series of matrix operations (fast).