rweb-scrapingdownload

How to use R to identify most up-to-date downloadable file on a web page?


I am trying to check periodically for date of latest downloadable files that are added to page https://github.com/mrc-ide/global-lmic-reports/tree/master/data, where the file names are like 2021-05-22_v8.csv.zip

There is a code snip mentioned in Using R to scrape the link address of a downloadable file from a web page? that can be used with a tweak, and identifies the date of the first or earliest downloadable file on a web page, shown below.

library(rvest)
library(stringr)
library(xml2)

page <- read_html("https://github.com/mrc-ide/global-lmic-reports/tree/master/data")

page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.csv.zip") %>% # find those that end in .csv.zip
  .[[1]]                    # look at the first one

Returns: [1] "/mrc-ide/global-lmic-reports/blob/master/data/2020-04-28_v1.csv.zip"

The question is what would be the code to identify the date of the latest .csv.zip file? E.g., 2021-05-22_v8.csv.zip as of checked on 2021-06-01.

The purpose is that if that date (i.e., 2021-05-22) is > latest update I have created in https://github.com/pourmalek/covir2 (e.g. IMPE 20210522 in https://github.com/pourmalek/covir2/tree/main/20210528), then a new update needs to be created.


Solution

  • You can convert the links to date and use which.max to get the latest one.

    library(rvest)
    library(stringr)
    library(xml2)
    
    page <- read_html("https://github.com/mrc-ide/global-lmic-reports/tree/master/data")
    
    page %>%
      html_nodes("a") %>%       # find all links
      html_attr("href") %>%     # get the url
      str_subset("\\.csv.zip") -> tmp # find those that end in .csv.zip
    
    tmp[tmp %>%
      basename() %>%
      substr(1, 10) %>%
      as.Date() %>% which.max()]
    
    #[1] "/mrc-ide/global-lmic-reports/blob/master/data/2021-05-22_v8.csv.zip"
    

    To get the data the latest date you can use -

    tmp %>%
      basename() %>%
      substr(1, 10) %>%
      as.Date() %>% max()
    
    #[1] "2021-05-22"