How to efficiently compare pairs of distribution rasters (raster
layers containing only 0 and 1)? I need to get a measure of the similarity among ~6500 individual global rasters. Istat
from SDMTools
should do the job.
Here is my code:
library(raster)
library(SDMTools)
Create reproducible example data: rasters with values of 0 and 1
# first raster
r1 <- raster(nrow=1800, ncol=3600, xmn=-18000000, xmx=18000000, ymn=-9000000, ymx=9000000,
crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs",
resolution=10000, vals=0)
r2 <- raster(nrow=1800, ncol=3600, xmn=-18000000, xmx=0, ymn=0, ymx=9000000,
crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs",
resolution=10000, vals=2)
r12 <- mosaic(r1, r2, fun=mean)
# second raster
r3 <- raster(nrow=1800, ncol=3600, xmn=-18000000, xmx=18000000, ymn=-9000000, ymx=9000000,
crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs",
resolution=10000, vals=0)
r4 <- raster(nrow=1800, ncol=3600, xmn=-12000000, xmx=15000000, ymn=2000000, ymx=3000000,
crs="+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs",
resolution=10000, vals=2)
r34 <- mosaic(r3, r4, fun=mean)
List rasters
files_list <- list(r12, r34)
Create empty matrix to fill with data from loop
ras_comp <- matrix(NA, nrow=length(files_list), ncol=length(files_list))
ras_comp
# label rows and columns of matrix
rownames(ras_comp) <- c("r12", "r34")
colnames(ras_comp) <- c("r12", "r34")
ras_comp
Loop to compare all possible pairs of matrices/rasters
for (i in 1:length(files_list)) {
# load raster i
ras_i <- as.matrix(files_list[[i]])
for (j in 1:length(files_list)) {
# load raster j
ras_j <- as.matrix(files_list[[j]])
# compare both rasters
ras_Istat <- Istat(ras_i, ras_j, old=F)
# write value into matrix
ras_comp[i,j] <- ras_Istat
}
}
Check final matrix
ras_comp
> ras_comp
r12 r34
r12 1.0000000 0.1814437
r34 0.1814437 1.0000000
Converting the rasters to matrices with as.matrix
reduces the calculation time significantly and the resulting final table is what I need but doing this for thousands of rasters takes forever to complete. How can the code be optimized in order to compare the rasters in a more efficient way?
Istat
does a bunch of tests and scalings before a simple computation. If you know those tests pass you can do the scalings as a one off and work on the scaled values. It does:
if (length(which(dim(x) == dim(y))) != 2)
stop("matrix / raster objects must be of the same extent")
if (min(c(x, y), na.rm = T) < 0)
stop("all values must be positive")
Then it checks where both rasters are "finite", which includes NA
values:
pos = which(is.finite(x) & is.finite(y))
It then computes the scaled values of the rasters:
px = x[pos]/sum(x[pos])
py = y[pos]/sum(y[pos])
H = sqrt(sum((sqrt(px) - sqrt(py))^2))
If old=FALSE
like you have then it returns:
return(1 - (H^2)/2)
> Istat(r12,r34)
[1] 0.1814437
If I cut out the tests and write a function that works on the scaled values I can get it down to:
fIstat = function(px,py){
1 - (sum((sqrt(px) - sqrt(py))^2))/2
}
Test by scaling the rasters and running:
r12px = r12[]/sum(r12[])
r34px = r34[]/sum(r34[])
fIstat(r12px, r34px)
# [1] 0.1814437
Same value. Great, but is it quicker?
> microbenchmark(fIstat(r12px, r34px), Istat(r12,r34))
Unit: milliseconds
expr min lq mean median uq
fIstat(r12px, r34px) 49.95867 78.28649 78.10863 79.45235 80.85234
Istat(r12, r34) 1084.84825 1181.31116 1217.64122 1212.93180 1263.50811
max neval
106.6803 100
1349.0239 100
Yes, by a large factor.
So... if your data have no missing values or infinities, create a files_list
of these scaled raster values, call my fIstat
, only loop over the upper triangle, and you should speed this up by a factor of 10.