rparallel-processingmultiprocessingtidyverserowwise

How to do parallel processing with rowwise


I am using rowwise to perform a function on each row. This takes a long time. In order to speed things up, is there a way to use parallel processing so that multiple cores are working on different rows concurrently?

As an example, I am aggregating PRISM weather data (https://prism.oregonstate.edu/) to the state level while weighting by population. This is based on https://www.patrickbaylis.com/blog/2021-08-15-pop-weighted-weather/.

Note that the below code requires downloads of daily weather data as well as the shapefile with population estimates at a very small geography.

library(prism)
library(tidyverse) 
library(sf)
library(exactextractr)
library(tigris)
library(terra)
library(raster)
library(ggthemes)

################################################################################
#get daily PRISM data
prism_set_dl_dir("/prism/daily/")
get_prism_dailys(type = "tmean", minDate = "2012-01-01", maxDate = "2021-07-31", keepZip=FALSE) 

Get states shape file and limit to lower 48    
states = tigris::states(cb = TRUE, resolution = "20m") %>%
    filter(!NAME %in% c("Alaska", "Hawaii", "Puerto Rico"))

setwd("/prism/daily")

################################################################################
#get list of files in the directory, and extract date
##see if it is stable (TRUE) or provisional data (FALSE)
list <- ls_prism_data(name=TRUE) %>% mutate(date1=substr(files, nchar(files)-11, nchar(files)-4), 
            date2=substr(product_name, 1, 11),
            year = substr(date2, 8, 11), month=substr(date2, 1, 3), 
            month2=substr(date1, 5, 6), day=substr(date2, 5, 6),
            stable = str_detect(files, "stable"))

################################################################################
#function to get population weighted weather by state

#run the population raster outside of the loop
# SOURCE: https://sedac.ciesin.columbia.edu/data/set/usgrid-summary-file1-2000/data-download - Census 2000, population counts for continental US
pop_rast = raster("/population/usgrid_data_2000/geotiff/uspop00.tif")
pop_crop = crop(pop_rast, states)

states = tigris::states(cb = TRUE, resolution = "20m") %>%
    filter(!NAME %in% c("Alaska", "Hawaii", "Puerto Rico"))

daily_weather <- function(varname, filename, date) {
    weather_rast = raster(paste0(filename, "/", filename, ".bil"))
    
    weather_crop = crop(weather_rast, states)
    
    pop_rs = raster::resample(pop_crop, weather_crop)
    
    states$value <- exact_extract(weather_crop, states, fun = "weighted_mean", weights=pop_rs)
    names(states)[11] <- varname
    
    states <- data.frame(states) %>% arrange(NAME) %>% dplyr::select(c(6,11))
    states
}

################################################################################
days <- list %>% rowwise() %>% mutate(states = list(daily_weather("tmean", files, date1))))

As is, each row takes about 7 seconds. This adds up with 3500 rows. And I want to get other variables beside tmean. So it will take a day or more to do everything unless I can speed it up.

I am mainly interested in solutions to be able to use parallel processing with rowwise, but I also welcome other suggestions of how to speed up the code in other ways.


Solution

  • you could try either purrr of its multiprocessed equivalent furrr (either map() or pmap()). The quickest method would be to use data.table. See this blog post that gives some benchmarks behind my recommendation