rpurrrcheckpointing

purrr and map: how to save intermediate computations?


Please consider the snippet at the end of the post. I would like to be able to save (possibly as an RDS) the results of the computations while they progress (e.g. every time a new 10% of the list is processed). How can I do that?

library(tidyverse)
ll <- 1:1000
res <- map(ll, \(x) cos(x))
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
#>  [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
#>  [9] ggplot2_3.5.1   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.5      compiler_4.4.1    reprex_2.1.0      tidyselect_1.2.1 
#>  [5] scales_1.3.0      yaml_2.3.8        fastmap_1.1.1     R6_2.5.1         
#>  [9] generics_0.1.3    knitr_1.46        munsell_0.5.1     R.cache_0.16.0   
#> [13] tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.3    rlang_1.1.3      
#> [17] utf8_1.2.4        stringi_1.8.4     xfun_0.43         fs_1.6.4         
#> [21] timechange_0.3.0  cli_3.6.2         withr_3.0.0       magrittr_2.0.3   
#> [25] digest_0.6.35     grid_4.4.1        hms_1.1.3         lifecycle_1.0.4  
#> [29] R.methodsS3_1.8.2 R.oo_1.26.0       vctrs_0.6.5       evaluate_0.23    
#> [33] glue_1.7.0        styler_1.10.3     fansi_1.0.6       colorspace_2.1-0 
#> [37] rmarkdown_2.26    tools_4.4.1       pkgconfig_2.0.3   htmltools_0.5.8.1

Created on 2024-06-27 with reprex v2.1.0


Solution

  • Turns out there's a package for that, currr ("checkpoint" + purrr). It doesn't save precisely in the form you specified (but see below for how to access intermediate results), but these functions (cp_map() for example)

    create a secret folder in your current working directory and save the results if they reach a given checkpoint. This way if you rerun the code, it reads the result from the cache folder and starts to evaluate where you finished. [slightly edited from original]

    cp_map() has a cp_option= argument that allows you to specify how often to checkpoint (i.e., how many checkpoints per job) and where to store the results.

    library(currr)
    options(currr.n_checkpoint = 10, currr.folder = "checkpoints")
    cc <- cp_map(1:1000, name = "cos_results", cos)
    list.files("checkpoints/cos_results")
    

    If you want to look at these intermediate outputs directly (rather than using them via the package as an automated checkpointing system) you'll have to figure out what these files are: it looks like the out* files are storing chunks of output (e.g. out_301.rds has the results for cos(301:400)).

     [1] "et_1.rds"    "et_101.rds"  "et_201.rds"  "et_301.rds"  "et_401.rds" 
     [6] "et_501.rds"  "et_601.rds"  "et_701.rds"  "et_801.rds"  "et_901.rds" 
    [11] "f.rds"       "id_1.rds"    "id_101.rds"  "id_201.rds"  "id_301.rds" 
    [16] "id_401.rds"  "id_501.rds"  "id_601.rds"  "id_701.rds"  "id_801.rds" 
    [21] "id_901.rds"  "meta.rds"    "out_1.rds"   "out_101.rds" "out_201.rds"
    [26] "out_301.rds" "out_401.rds" "out_501.rds" "out_601.rds" "out_701.rds"
    [31] "out_801.rds" "out_901.rds" "st_1.rds"    "st_101.rds"  "st_201.rds" 
    [36] "st_301.rds"  "st_401.rds"  "st_501.rds"  "st_601.rds"  "st_701.rds" 
    [41] "st_801.rds"  "st_901.rds"  "x.rds"