rr-markdowndoparallel

r markdown generates duplicates if run in parallel


I am generating several reports via r markdown. If I do them one by one - everything is okay. If I use %do% - also okay. If I use %dopar% - 3 options:

  1. Sometimes it's okay.
  2. Sometimes reports have different names but same content.
  3. Sometimes pandoc fails with error: pandoc document conversion failed with error 1

How to fix that?

Here is code that works fine in 100% of cases:

library(tidyverse)
library(parallel)
library(doParallel)



OutputFolder <- "c:\\temp\\test\\out"
result_foldername <- "Now"

ServersInDB <<- c("server1.ru", "server2.ru")

cores=detectCores(logical = FALSE)

cl <- parallel::makeCluster(cores-1) #not to overload your computer

registerDoParallel(cl)

render_all_obj <- function  (MachineName, OutputFolder, result_foldername)
{
  
  library(rmarkdown)
  render(input = "c:\\temp\\test\\proj\\Report.RMD",
         output_file = paste0(MachineName, ".html"),
         output_dir = file.path (OutputFolder, result_foldername  ),
         params = list(MachineName = MachineName)
  )
  
}

foreach (MachineName = ServersInDB) %do% {
  
  render_all_obj(MachineName, OutputFolder, result_foldername)
}

parallel::stopCluster(cl)

Here is code that fails.

library(tidyverse)
library(parallel)
library(doParallel)



OutputFolder <- "c:\\temp\\test\\out"
result_foldername <- "Now"

ServersInDB <<- c("server1.ru", "server2.ru")

cores=detectCores(logical = FALSE)

cl <- parallel::makeCluster(cores[1]-1) #not to overload your computer

registerDoParallel(cl)

render_all_obj <- function  (MachineName, OutputFolder, result_foldername)
{
  
  library(rmarkdown)
  render(input = "c:\\temp\\test\\proj\\Report.RMD",
         output_file = paste0(MachineName, ".html"),
         output_dir = file.path (OutputFolder, result_foldername  ),
         params = list(MachineName = MachineName)
  )
  
}

foreach (MachineName = ServersInDB) %dopar% {
  
  render_all_obj(MachineName, OutputFolder, result_foldername)
}

parallel::stopCluster(cl)

Here is my rmd:


---
output:
  html_document:
    toc: true
    dev: 'svg'
    number_sections: true
    toc_depth: 2
    toc_float: true
    theme: cerulean
    toc_collapsed: true
    self_contained: true
    mathjax: NULL

params: 
  MachineName: "ServerName" #name of server to analyze

---



```{r , echo=FALSE, include=FALSE, results='hide'}

MachineName <- params$MachineName

```



---
title: "My report is about: `r MachineName`"

---


Solution

  • The problem was - the file with name Report.knit.md. By default it's created in directory specified with parameter input of rmarkdown::render function. Which is same directory for all parallel processes. All processes are trying to perform create, read, write operations with same file.

    Workaround was to use intermediates_dir parameter and unique temp directory for every process.

    Working solution:

    registerDoFuture()
    
    workers <- parallel::detectCores(logical = FALSE) - 1
    future::plan(multisession, workers = workers)
    
    
    ServersInDB <- c("server1.ru", "server2.ru")
    
    render_all_obj <- function  (MachineName)
    {
      
      OutputFolder <- "c:/temp/test/out"
      result_foldername <- "Now"
      
      library(rmarkdown)
      
      tf <- tempfile()
      dir.create(tf)
      
      render(input = "c:/temp/test/proj/Report.RMD",
             output_file = paste0(MachineName, ".html"),
             intermediates_dir=tf,
             output_dir = file.path (OutputFolder, result_foldername),
             params = list(MachineName = MachineName)
      )
      
      unlink(tf)
      
    }
    
    
    ServersInDB %>% furrr::future_map(render_all_obj)