rlinuxcluster-computingrasterrasterize

Rasterize polygons in R using snowfall & sfLapply


I would like to rasterize a very large vector file to 25m and have had some success with the 'cluster' package, adapting the qu's here and here, which worked nicely for that particular data.

However I now have a larger vector file that needs rasterizing and have access to a cluster that uses snowfall. I'm not used to cluster functions and i'm just not sure how to set up sfLapply. I am consistently getting the following sort of error as sfLapply is called in the cluster:

Error in checkForRemoteErrors(val) : 
  one node produced an error: 'quote(96)' is not a function, character or symbol
Calls: sfLapply ... clusterApply -> staticClusterApply -> checkForRemoteErrors

my full code:

library(snowfall)
library(rgeos)
library(maptools)
library(raster)
library(sp)

setwd("/home/dir/")

# Initialise the cluster...
hosts = as.character(read.table(Sys.getenv('PBS_NODEFILE'),header=FALSE)[,1]) # read the nodes to use
sfSetMaxCPUs(length(hosts)) # make sure the maximum allowed number of CPUs matches the number of hosts
sfInit(parallel=TRUE, type="SOCK", socketHosts=hosts, cpus=length(hosts), useRscript=TRUE) # initialise a socket cluster session with the named nodes
sfLibrary(snowfall)

# read in required data

shp <- readShapePoly("my_data.shp")
BNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"
crs(shp) <- BNG

### rasterize the uniques to 25m and write (GB and clipped) ###
rw <- raster(res=c(25,25), xmn=0, xmx=600000, ymn=0, ymx=1000000, crs=BNG)

# Number of polygons features in SPDF
features <- 1:nrow(shp[,])

# Split features in n parts
n <- 96
parts <- split(features, cut(features, n))

rasFunction = function(X, shape, raster, nparts){
    ras = rasterize(shape[nparts[[X]],], raster, 'CODE')
    return(ras)
}

# Export everything in the workspace onto the cluster...
sfExportAll()

# Distribute calculation across the cluster nodes...
rDis = sfLapply(n, fun=rasFunction,X=n, shape=shp, raster=rw, nparts=parts) # equivalent of sapply
rMerge <- do.call(merge, rDis)

writeRaster(rMerge, filename="my_data_25m",  format="GTiff", overwrite=TRUE)

# Stop the cluster...
sfStop()

i've tried a number of things, changing the function and sfLapply, but i just can't get this to run. thanks


Solution

  • Ok so I abandoned snowfall and I looked into gdalUtils::gdal_rasterize instead and found a lot of benefits to using it (with one downside that someone might be able to answer?)

    Context & Issue: My vector data exist inside an ESRI file Geodatabase and require some processing pre rasterization. No problem, rgdal::readOGR is fine. However as gdal_rasterize requires a pathname to the vector data, i had trouble here because I could not write out my processed vector data, they exceed the max file size for a shapefile outside of a geodatabase and gdal_rasterize will not accept objects, paths to .gdbs or .Rdata/.rds files. How do I pass an object to gdal_rasterize??

    So I wrote out the large shapefile in segments equal to number of processors.

    Originally raster::rasterize was used as I could simply pass the vector object stored in memory to rasterize without the writing problem (though I would have liked to have it written), rasterizing this data to 25m. This took a pretty long time, even in parallel.

    Solution: gdal_rasterize in parallel.

    # gdal_rasterize in parallel
    require(gdalUtils)
    require(rgdal)
    require(rgeos)
    require(cluster)
    require(parallel)
    require(raster)
    
    # read in vector data
    shape <- readOGR("./mygdb.gdb", layer="mydata",stringsAsFactors=F)
    
    ## do all the vector processing etc ##
    
    # split vector data into n parts, the same as number of processors (minus 1)
    npar <- detectCores() - 1
    features <- 1:nrow(shape[,])
    parts <- split(features, cut(features, npar))
    
    # write the vector parts out
    for(n in 1:npar){
      writeOGR(shape[parts[[n]],], ".\\parts", paste0("mydata_p",n), driver="ESRI Shapefile")
    }
    
    # set up and write a blank raster for gdal_rasterize for EACH vector segment created above
    r <- raster(res=c(25,25), xmn=234000, xmx=261000, ymn=229000, ymx=256000, crs=projection(shape))    
    for(n in 1:npar){
    writeRaster(r, filename=paste0(".\\gdal_p",n,".tif"), format="GTiff", overwrite=TRUE)
    }
    
    # set up cluster and pass required packages and objects to cluster
    cl <- makeCluster(npar)
    clusterEvalQ(cl, sapply(c('raster', 'gdalUtils',"rgdal"), require, char=TRUE))
    clusterExport(cl, list("r","npar"))
    
    # parallel apply the gdal_rasterize function against the vector parts that were written, 
    # same number as processors, against the pre-prepared rasters
    parLapply(cl = cl, X = 1:npar, fun = function(x) gdal_rasterize(src_datasource=paste0(".\\parts\\mydata_p",x,".shp"),
    dst_filename=paste0(".\\gdal_p",n,".tif"),b=1,a="code",verbose=F,output_Raster=T))
    
    # There are now n rasters representing the n segments of the original vector file
    # read in the rasters as a list, merge and write to a new tif. 
    s <- lapply(X=1:npar, function(x) raster(paste0(".\\gdal_p",n,".tif")))
    s$filename <- "myras_final.tif"
    do.call(merge,s)
    stopCluster(cl)
    

    The time (split 60% for vector reading/processing/writing & 40% for raster generation and rasterization) for the entire job in this code was around 9 times quicker than raster::rasterize in parallel.

    Note: I tried this initially by splitting the vectors into n parts but creating only 1 blank raster. I then wrote to the same blank raster from all cluster nodes simultaneously but this corrupted the raster and made it unusable in R/Arc/anything (despite going through the function without error). Above is a more stable way, but n blank rasters have to be made instead of 1, increasing processing time, plus merging n rasters is extra processing.

    caveat - raster::rasterize in parallel did not have writeRaster inside the rasterize function but rather as a separate line, which will have increased processing time in the original run due to storage to temp files etc.

    EDIT: Why are the frequency tables from the raster from gdal_rasterize not the same as raster::rasterize? I mean with 100million cells i expect a bit of difference but for some codes it was a few 1000 cells different. I thought they both rasterized by centroid?