I'd like to read a remote archive file with vroom and get a additional column with the filenames instead of archive name. Is this possible with vroom without the local archive_extract step as shown in the example below?
Thank you
library(tidyverse)
library(archive)
library(vroom)
file <- "ftp://opendata.dwd.de/climate_environment/CDC/grids_germany/daily/regnie/ra2021m.tar"
test1 <- vroom_fwf(file, col_positions = fwf_widths(rep(4, 611)),
col_types = , cols(.default = col_integer()),
na = "-999", id = "filename")
test1$filename %>% unique()
#> [1] "ftp://opendata.dwd.de/climate_environment/CDC/grids_germany/daily/regnie/ra2021m.tar"
my_dir <- fs::file_temp() %>% fs::dir_create()
archive_extract(file, dir = my_dir)
test2 <- fs::dir_ls(my_dir) %>%
vroom_fwf( col_positions = fwf_widths(rep(4, 611)),
col_types = , cols(.default = col_integer()),
na = "-999", id = "filename")
test2$filename %>% unique()
#> [1] ".../AppData/Local/Temp/Rtmp2TTpuI/filebfd82b6b1f6/ra210101.gz"
#> [2] ".../AppData/Local/Temp/Rtmp2TTpuI/filebfd82b6b1f6/ra210102.gz"
#> [3] ".../AppData/Local/Temp/Rtmp2TTpuI/filebfd82b6b1f6/ra210103.gz"
...
Created on 2022-07-25 by the reprex package (v2.0.1)
This is what the vroom vignette suggests:
Reading single files from multiple multi-file zip archives
If you are reading a zip file that contains multiple files with the same format, you can use a wrapper function like this:
read_all_zip <- function(file, ...) { filenames <- unzip(file, list = TRUE)$Name vroom(purrr::map(filenames, ~ unz(file, .x)), ...) }
Adapted to your use case, this gives something like:
read_all_tar_remote_v1 <- function(file) {
con <- file(file, open = "rb")
filenames <- untar(con, list = T)
df <- purrr::map(filenames,~ vroom_fwf(archive_read(file, file = .x, format='tar'),
col_positions = fwf_widths(rep(4, 611)),
col_types = , cols(.default = col_integer()),
na = "-999", id = "filename", guess_max=2000))
df
}
read_all_tar_remote_v1(file)
However, this is slow (and crashes more often than not with my poor internet connection) because, as mentioned here, untar
needs to read the whole archive in order to get the file names.
this does download the whole archive, since untar needs to read the whole file to see what's in it. There is no master directory in a tar file for untar to read; each file always has its own 512-byte header block. You don't need to save it to your hard disk to read the directory, but it may be just as easy to do so.
Hence, I suppose, your question.
One way to avoid this is to use archive_read
with an index position.
read_all_tar_remote_v2 <- function(file) {
df <- purrr::map(1:365,~ vroom_fwf(archive_read(file, file = .x, format='tar'),
col_positions = fwf_widths(rep(4, 611)),
col_types = , cols(.default = col_integer()),
na = "-999", id = "filename", guess_max=2000))
df
}
This does not give you the exact file names however, but at least an index which allows you to differentiate them. This is the only improvement over your current implementation.
mylist <- read_all_tar_remote_v2(file)
mylist[[1]]$filename %>% unique
[1] "archive_read(ftp://opendata.dwd.de/climate_environment/CDC/grids_germany/daily/regnie/ra2021m.tar)[1]"
Since you might not know the number of files prior to reading, you may want to include error management in your function.
read_all_tar_remote_v3 <- function(file, maxfiles = 10000) {
mylist <- list()
for (i in 1:maxfiles) {
print(paste('reading file', i,'/', maxfiles))
#ERROR HANDLING
possibleError <- tryCatch({
mydf <- vroom_fwf(archive_read(file, file = i, format='tar'),
col_positions = fwf_widths(rep(4, 611)),
col_types = , cols(.default = col_integer()),
na = "-999", id = "filename", guess_max=2000)
mylist[[i]] <- mydf
},
error=function(e) e
)
if(inherits(possibleError, "error")){
break
}
}
return(mylist)
}
Is this faster or better than your current approach? I let you decide, but I wouldn't say so.
I would keep extracting the individual files, as reading names without the whole thing unfortunately seems to be a limitation of the tar
format.