rperformancedistributionkernlab

Make program based on kernlab function run faster


For a range I want to identify where changes in distribution happens and where that value is the maximum. Currently I am using a kernel maximum discrepancy test for every value in the range and I am taking the 200 values before and after that value then I extract the locations where the mmd statistic is maximum. But this is very computationally intensive to calculate in R. Please note that I am using kernlab to calculate kmmd. I want to know if there is a way to do this faster? Or if you have any suggestions. Any help would be appreciated.

My code is:

    cvg<-seq(1,2000)
    cvg<-cvg^3-2*cvg^2+5*cvg
    myRange<-seq(400:(length(cvg)-400))
    kernel<-"splinedot"
    cvg[201:(length(cvg)-200)]->cvg
    myRange<-seq(400:(length(cvg)-400))
    lapply(myRange, function(x) mmdstats(kmmd((as.matrix(cvg[(x+1):(x+400)])), (as.matrix(cvg[(x+801):(x+1200)])), kernel=kernel)))->kmm.ls
    as.data.frame(as.matrix(kmm.ls))->kmm.ls
    lapply(kmm.ls, function(x) which.max(mmdstats(x)))->store.max

Solution

  • I state that I am not an expert on the subject with kernlab so I can not judge the correctness of your analysis or improve your code. However, I can suggest you convert your lapply call to a parallelized version such as sfLapply, parLapply, mclapply future_lapply ecc. Here I post an example with sfLapply from the snowfall package(which is really straightforward imo):

    #your original lapply call took 500 seconds on my PC
    system.time(kmm.ls <- lapply(myRange, function(x) mmdstats(kmmd((as.matrix(cvg[(x+1): 
    (x+400)])), (as.matrix(cvg[(x+801):(x+1200)])), kernel=kernel))))
    
    
    library(kernlab)
    library(snowfall)
    sfInit(parallel=TRUE,cpus = parallel::detectCores()-1)
    # Load the required packages inside the cluster
    sfLibrary(kernlab)
    #export all variable in all the cluster
    sfExportAll()
    # Run parallelized lapply with custom function  
    #sfLapply took 22 second on my 48 cores PC
    system.time(kmm.ls <- sfLapply(myRange, function(x) 
    mmdstats(kmmd((as.matrix(cvg[(x+1):(x+400)])), (as.matrix(cvg[(x+801):(x+1200)])), 
    kernel=kernel))))
    #stop cluster
    sfStop()
    

    This is an example with only the first lapply call of your code, but the same idea can be applied to the second call (when I tried to run your code, the second lapply call give me an error)

    Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘mmdstats’ for signature ‘"list"’

    It doesn't seem a critical error but as I said I don't feel prepared to advise how to fix it.