rtorchmnist

Problems downloading a dataset for Torch for R


I'm trying to load the MNIST dataset in Torch for R, following https://skeydan.github.io/Deep-Learning-and-Scientific-Computing-with-R-torch/overfitting.html#classic-data-augmentation:

library(torch)
library(torchvision)
library(luz)

dir <- "~/.torch-datasets"

valid_ds <- mnist_dataset(
  dir,
  download = TRUE,
  train = FALSE,
  transform = transform_to_tensor
)

I may have done something stupid before the process completed the first time I ran the above code, since every time I try again, I get this error:

Error: [EEXIST] Failed to copy 'C:/Users/Admin/AppData/Local/torch/torch/Cache/mnist/train-images-idx3-ubyte.gz' to 'C:/Users/Admin/.torch-datasets/mnist/raw/train-images-idx3-ubyte.gz': file already exists

I've tried closing the R session (via the R GUI) and manually deleting both copies of the "mnist" folder containing the errant file, but all that changes is that when I restart R and rerun the above code, it downloads the .gz file, but doesn't open it, instead giving a different error:

Error in if (!tools::md5sum(destpath) == r[2]) runtime_error("MD5 sums are not identical for file: {r[1]}.") : 
  missing value where TRUE/FALSE needed

Then if I try again, I get the [EEXIST] error again.

How do I totally clear everything out so this file can be properly downloaded and put into action? I'm on Windows 11, using the currently newest versions of R and those three packages.


Solution

  • Downloading dataset archives (*.gz) is handled by torchvision:::download_and_cache() , which stores those in a separate persistent cache location so restarting R will not clear it; mnist_dataset() will use those cached files as long as those are present; counterintuitive, but setting download = TRUE does not change that.

    You can get Cache path through rappdirs::user_cache_dir("torch") . In your case it’s C:/Users/Admin/AppData/Local/torch/torch/Cache/ (extracted from error).

    Location where torchvision::mnist_dataset() stores processed datasets and copies of downloaded files is the 1st argument to mnist_dataset(), the one you set with dir <- "~/.torch-datasets". Tilde is exapnded to your R user “home” directory, in Windows it usually points to C:/Users/YourUsername/Documents, but in your case it’s apparently C:/Users/Admin/, so final root path for mnist_dataset() is C:/Users/Admin/.torch-datasets/

    missing value where TRUE/FALSE needed in your error hints that tools::md5sum(destpath) was called with non-existing destpath (gz file location in dataset folder), perhaps copying files from cache location to dataset location failed (silently, without an error) for some reason. Might be related to source (the Cache) or destination (dataset root folder), your system state & setup (paths are not local but mapped from a cloud or network drives, antivirus, active policies, …), your exact trail of actions since first calling mnist_dataset().

    As there are too many variables, I’d just start over by removing (as in removing or renaming, not just emptying) both cache (C:/Users/Admin/AppData/Local/torch/torch/Cache/) and dataset root folder (C:/Users/Admin/.torch-datasets/) so mnist_dataset() could re-create those from scratch.

    You could also use as many dataset locations & copies as you like, e.g. mnist_dataset("datasets", ...) would set up, populate and use a new datasets/ in your current working directory. Though if it's the corrupted Cache folder that’s causing your issues, you'd probably end up with a similar state. So better start with no Cache.


    Checking torch package cache and existing dataset, removing both and and creating a new MNIST dataset in current working directory ( D:/r/r-torch/ ):

    library(torchvision)
    
    dir <- "torch-datasets" # no tilde, torch-datasets/ in current wd
    
    # package cahce content
    torch_cache <- rappdirs::user_cache_dir("torch")
    fs::dir_info(torch_cache, recurse = T) |> _[,c(1:3,5)]
    #> # A tibble: 5 × 4
    #>   path                                                                            type          size modification_time  
    #>   <fs::path>                                                                      <fct>     <fs::by> <dttm>             
    #> 1 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist                            directory        0 2025-04-13 13:29:16
    #> 2 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist/t10k-images-idx3-ubyte.gz  file         1.57M 2025-04-13 13:29:15
    #> 3 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist/t10k-labels-idx1-ubyte.gz  file         4.44K 2025-04-13 13:29:16
    #> 4 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist/train-images-idx3-ubyte.gz file         9.45M 2025-04-13 13:27:29
    #> 5 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist/train-labels-idx1-ubyte.gz file         28.2K 2025-04-13 13:28:47
    
    # content of dataset in wd
    fs::dir_info(dir, recurse = T) |> _[,c(1:3,5)]
    #> # A tibble: 9 × 4
    #>   path                                                type             size modification_time  
    #>   <fs::path>                                          <fct>     <fs::bytes> <dttm>             
    #> 1 torch-datasets/mnist                                directory           0 2025-04-13 13:29:13
    #> 2 torch-datasets/mnist/processed                      directory           0 2025-04-13 13:29:27
    #> 3 torch-datasets/mnist/processed/test.rds             file            2.82M 2025-04-13 13:29:28
    #> 4 torch-datasets/mnist/processed/training.rds         file           16.91M 2025-04-13 13:29:27
    #> 5 torch-datasets/mnist/raw                            directory           0 2025-04-13 13:29:16
    #> 6 torch-datasets/mnist/raw/t10k-images-idx3-ubyte.gz  file            1.57M 2025-04-13 13:29:15
    #> 7 torch-datasets/mnist/raw/t10k-labels-idx1-ubyte.gz  file            4.44K 2025-04-13 13:29:16
    #> 8 torch-datasets/mnist/raw/train-images-idx3-ubyte.gz file            9.45M 2025-04-13 13:27:29
    #> 9 torch-datasets/mnist/raw/train-labels-idx1-ubyte.gz file            28.2K 2025-04-13 13:28:47
    
    # remove both 
    # ( commented out to avoid unintentional removal )
    # fs::dir_delete(c(torch_cache, dir))
    
    # check if indeed removed
    fs::file_exists(c(torch_cache, dir))
    #> C:\\Users\\margu\\AppData\\Local/torch/torch/Cache                                     torch-datasets 
    #>                                              FALSE                                              FALSE
    
    # re-create dataset
    valid_ds <- mnist_dataset(
      dir,
      download = TRUE,
      train = FALSE,
      transform = transform_to_tensor
    )
    #> Processing...
    #> Done!
    
    # check re-created cache and dataset
    torch_cache <- rappdirs::user_cache_dir("torch")
    fs::dir_info(torch_cache, recurse = T) |> _[,c(1:3,5)]
    #> # A tibble: 5 × 4
    #>   path                                                                            type          size modification_time  
    #>   <fs::path>                                                                      <fct>     <fs::by> <dttm>             
    #> 1 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist                            directory        0 2025-04-13 14:06:37
    #> 2 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist/t10k-images-idx3-ubyte.gz  file         1.57M 2025-04-13 14:06:37
    #> 3 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist/t10k-labels-idx1-ubyte.gz  file         4.44K 2025-04-13 14:06:37
    #> 4 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist/train-images-idx3-ubyte.gz file         9.45M 2025-04-13 14:06:36
    #> 5 C:/Users/margu/AppData/Local/torch/torch/Cache/mnist/train-labels-idx1-ubyte.gz file         28.2K 2025-04-13 14:06:36
    
    fs::dir_info(fs::path_abs(dir), recurse = T) |> _[,c(1:3,5)]
    #> # A tibble: 9 × 4
    #>   path                                                             type             size modification_time  
    #>   <fs::path>                                                       <fct>     <fs::bytes> <dttm>             
    #> 1 D:/r/r-torch/torch-datasets/mnist                                directory           0 2025-04-13 14:06:33
    #> 2 D:/r/r-torch/torch-datasets/mnist/processed                      directory           0 2025-04-13 14:06:44
    #> 3 D:/r/r-torch/torch-datasets/mnist/processed/test.rds             file            2.82M 2025-04-13 14:06:45
    #> 4 D:/r/r-torch/torch-datasets/mnist/processed/training.rds         file           16.91M 2025-04-13 14:06:44
    #> 5 D:/r/r-torch/torch-datasets/mnist/raw                            directory           0 2025-04-13 14:06:37
    #> 6 D:/r/r-torch/torch-datasets/mnist/raw/t10k-images-idx3-ubyte.gz  file            1.57M 2025-04-13 14:06:37
    #> 7 D:/r/r-torch/torch-datasets/mnist/raw/t10k-labels-idx1-ubyte.gz  file            4.44K 2025-04-13 14:06:37
    #> 8 D:/r/r-torch/torch-datasets/mnist/raw/train-images-idx3-ubyte.gz file            9.45M 2025-04-13 14:06:36
    #> 9 D:/r/r-torch/torch-datasets/mnist/raw/train-labels-idx1-ubyte.gz file            28.2K 2025-04-13 14:06:36
    

    Previous revision, clearing torch package cache.