rfor-loopr-rasterpairwise-distancesdmtools

Pairwise raster comparison in R: alternative to for-loop?


How to efficiently compare pairs of distribution rasters (raster layers containing only 0 and 1)? I need to get a measure of the similarity among ~6500 individual global rasters. Istat from SDMTools should do the job.

Here is my code:

library(raster)
library(SDMTools)

Create reproducible example data: rasters with values of 0 and 1

# first raster
r1 <- raster(nrow=1800, ncol=3600, xmn=-18000000, xmx=18000000, ymn=-9000000, ymx=9000000, 
             crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs", 
             resolution=10000, vals=0)
r2 <- raster(nrow=1800, ncol=3600, xmn=-18000000, xmx=0, ymn=0, ymx=9000000, 
             crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs", 
             resolution=10000, vals=2)
r12 <- mosaic(r1, r2, fun=mean)

# second raster
r3 <- raster(nrow=1800, ncol=3600, xmn=-18000000, xmx=18000000, ymn=-9000000, ymx=9000000, 
             crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs", 
             resolution=10000, vals=0)
r4 <- raster(nrow=1800, ncol=3600, xmn=-12000000, xmx=15000000, ymn=2000000, ymx=3000000, 
             crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs", 
             resolution=10000, vals=2)
r34 <- mosaic(r3, r4, fun=mean)

List rasters

files_list <- list(r12, r34)

Create empty matrix to fill with data from loop

ras_comp <- matrix(NA, nrow=length(files_list), ncol=length(files_list))
ras_comp
# label rows and columns of matrix
rownames(ras_comp) <- c("r12", "r34")
colnames(ras_comp) <- c("r12", "r34")
ras_comp

Loop to compare all possible pairs of matrices/rasters

for (i in 1:length(files_list)) {
  # load raster i
  ras_i <- as.matrix(files_list[[i]])

  for (j in 1:length(files_list)) {
    # load raster j
    ras_j <- as.matrix(files_list[[j]])

    # compare both rasters
    ras_Istat <- Istat(ras_i, ras_j, old=F)

    # write value into matrix
    ras_comp[i,j] <- ras_Istat
  }
}

Check final matrix

ras_comp
> ras_comp
          r12       r34
r12 1.0000000 0.1814437
r34 0.1814437 1.0000000

Converting the rasters to matrices with as.matrix reduces the calculation time significantly and the resulting final table is what I need but doing this for thousands of rasters takes forever to complete. How can the code be optimized in order to compare the rasters in a more efficient way?


Solution

  • Istat does a bunch of tests and scalings before a simple computation. If you know those tests pass you can do the scalings as a one off and work on the scaled values. It does:

    if (length(which(dim(x) == dim(y))) != 2) 
        stop("matrix / raster objects must be of the same extent")
    if (min(c(x, y), na.rm = T) < 0) 
        stop("all values must be positive")
    

    Then it checks where both rasters are "finite", which includes NA values:

    pos = which(is.finite(x) & is.finite(y))
    

    It then computes the scaled values of the rasters:

    px = x[pos]/sum(x[pos])
    py = y[pos]/sum(y[pos])
    H = sqrt(sum((sqrt(px) - sqrt(py))^2))
    

    If old=FALSE like you have then it returns:

        return(1 - (H^2)/2)
    
    > Istat(r12,r34)
    [1] 0.1814437
    

    If I cut out the tests and write a function that works on the scaled values I can get it down to:

    fIstat = function(px,py){
        1 - (sum((sqrt(px) - sqrt(py))^2))/2
    }
    

    Test by scaling the rasters and running:

    r12px = r12[]/sum(r12[])
    r34px = r34[]/sum(r34[])
    fIstat(r12px, r34px)
    # [1] 0.1814437
    

    Same value. Great, but is it quicker?

    > microbenchmark(fIstat(r12px, r34px), Istat(r12,r34))
    Unit: milliseconds
                     expr        min         lq       mean     median         uq
     fIstat(r12px, r34px)   49.95867   78.28649   78.10863   79.45235   80.85234
          Istat(r12, r34) 1084.84825 1181.31116 1217.64122 1212.93180 1263.50811
           max neval
      106.6803   100
     1349.0239   100
    

    Yes, by a large factor.

    So... if your data have no missing values or infinities, create a files_list of these scaled raster values, call my fIstat, only loop over the upper triangle, and you should speed this up by a factor of 10.