pythonrperformancerpy2

For loop is several times faster in R than in Python using the rpy2 library


The following simply for block takes about ~3 sec to complete in R:

library(MASS)
nruns <- 2000
nelems <- 50
maxX <- 1
maxY <- 1
for(i in 1:nruns) {
    dataX <- runif(nelems, 0, maxX)
    dataY <- runif(nelems, 0, maxY)
    kde2d(dataX, dataY, n=50, lims=c(0, maxX, 0, maxY) )
}

The same code run in Python through the rpy2 library takes between 4-5 times more:

from rpy2.robjects import r
from rpy2.robjects.packages import importr
importr('MASS')

nruns = 2000
r.assign('nelems', 50)
r.assign('maxX', 1)
r.assign('maxY', 1)
for _ in range(nruns):
    r('dataX <- runif(nelems, 0, maxX)')
    r('dataY <- runif(nelems, 0, maxY)')
    r('kde2dmap <- kde2d(dataX, dataY, n=50, lims=c(0, maxX, 0, maxY))')

Is this just because I'm using the rpy2 library to communicate with R or is there something else at play? Can this be improved in any way (while still running the code in Python)?


Solution

  • 4 to 5 times slower seems a little much, but this might be the case if you are using custom conversion (rpy2 can convert R objects to arbitrary Python objects on the fly - see the doc).

    Or may be you are on an HPC with a slow-ish NFS access for where your Python and packages are installed while R is on faster local disks (this could make a big difference on the startup time).

    Otherwise one can also keep the loop in R to assess whether this changes the running time:

    from rpy2.robjects import r
    from rpy2.robjects.packages import importr
    
    # importr('MASS')
    # Calling 'importr' will perform quite a bit of work behind the
    # scene. That works allows a more intuitive/pythonic use of the
    # content of the R library "MASS", but if you are just passing
    # a string to be evaluated for R evaluation you can skip it
    # replace it with the following:
    r('library("MASS")')
    
    nruns = 2000
    r.assign('nelems', 50)
    r.assign('maxX', 1)
    r.assign('maxY', 1)
    r.assign('nruns', nruns)
    r("""
    for(i in 1:nruns) {
      dataX <- runif(nelems, 0, maxX)
      dataY <- runif(nelems, 0, maxY)
      kde2dmap <- kde2d(dataX, dataY, n=50, lims=c(0, maxX, 0, maxY) )
    }
    """)
    

    Speed improvements will come from the following:

    An additional comment about performance is that rpy2's transition from C-extension to cffi has lead to significant improvements in the structure of the code managing the dialog with R's C API (and with that a number of tricky bugs where fixed), but at the temporary cost of performance here and there. Optimizations for speed are being progressively reintroduced.