rxbrl

Parse multiple XBRL files stored in a zip file


I have downloaded multiple zip files from a website. Each zip file contains multiple html and xml extension files (~ 100K in each).

It is possible to manually extract the files and then parse them. However, i would like to be able to do this within R (if possible)

Example file (sorry it is a bit big) using code from a previous question - download one zip file

library(XML)

pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
doc <- htmlParse(pth)

myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs][[1]]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]

dir.create("temp", "hmrcCache")
download.file(fileURLS, destfile = file.path("temp", myfiles))

I can parse the files using the XBRL package if i manually extract them. This can be done as follows

library(XBRL)     
inst <- file.path("temp", "Prod224_0004_00000121_20130630.html")
out <- xbrlDoAll(inst, cache.dir="temp/hmrcCache", prefix.out=NULL, verbose=T)

I am struggling with how to extract these files from the zip folder and parse each , say, in a loop using R, without manually extracting them. I tried making a start, but don't know how to progress from here. Thanks for any advice.

# Get names of files
lst <- unzip(file.path("temp", myfiles), list=TRUE)
dim(lst) # 118626

# unzip  and extract first file
nms <- lst$Name[1] # Prod224_0004_00000121_20130630.html
lst2 <- unz(file.path("temp", myfiles), filename=nms)

I am using Windows 8.1

R version 3.1.2 (2014-10-31)

Platform: x86_64-w64-mingw32/x64 (64-bit)


Solution

  • Using the suggestion from Karsten in the comments, I unzipped the files to a temporary directory, and then parsed each file. I used the snow package to speed things up.

      # Parse one zip file to start
      fls <- list.files(temp)[[1]]
    
      # Unzip 
      tmp <- tempdir()
      lst <- unzip(file.path(temp, fls), exdir=tmp)
    
      # Only parse first 10 records
      inst <- lst[1:10]
          
      # Start to parse - in parallel
      cl <- makeCluster(parallel::detectCores())
      clusterCall(cl, function() library(XBRL))
      
      # Start
      st <- Sys.time()
      
      out <- parLapply(cl, inst, function(i) 
                                      xbrlDoAll(i, 
                                                cache.dir="temp/hmrcCache", 
                                                prefix.out=NULL, verbose=T) )
      
      stopCluster(cl)
      
      Sys.time() - st