htmlcssrweb-scraping

Using R to get download URL by link name


I'm trying to use rvest to download a list of files from this site. The file names are regular, but the download URLs don't match a pattern (just dozens of digits), so I can't construct a list of download URLs based on any criteria. How can I use link names to download the actual files?

So far, I can get a list of the files of interest (based on CSS selector), and I can get a list of all the links on the site, but I'm not sure how to match them up. I'll need to be able to check the site for changes and download any files with changed names, so using the file name to access the file is important. I'm not very familiar with HTML/CSS, so that might be why I can't figure out this possibly-simple task.

library(rvest)

# url with list of download files

url <- "http://www-air.larc.nasa.gov/cgi-bin/ArcView/actamerica.2016?C130=1"
doc <- read_html(url)

# getting everything within the CSS selector "td a"

all <- html_text(html_nodes(doc, "td a"))

# getting list of certain file names

filetype <- "PICARRO"
files <- all[grep(filetype, all)]

# this returns a list of all links on the page, 
# but I'm not sure how to match the links up with their names

html_attr(html_nodes(doc, "a"), "href")

Thank you in advance for any help.


Solution

  • Slightly different approach.

    Grab all downloadable filenames and URLs:

    library(httr)
    library(rvest)
    library(tidyverse)
    
    pg <- read_html("http://www-air.larc.nasa.gov/cgi-bin/ArcView/actamerica.2016?C130=1")
    
    fils <- html_nodes(pg, xpath=".//a[contains(@href, 'cgi-bin/enzFile')]")
    
    data_frame(
      filename = html_text(fils),
      link = sprintf("http://www-air.larc.nasa.gov%s", html_attr(fils, "href"))
    ) -> xdf
    
    glimpse(xdf)
    ## Observations: 719
    ## Variables: 2
    ## $ filename <chr> "ACTAMERICA-Elevation_C130_20160711_R0.ict", "ACTAMERICA-Elevation_C130_20160715_R0.ict", "ACTAMERI...
    ## $ link     <chr> "http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f...
    
    xdf
    ## # A tibble: 719 x 2
    ## filename                                                                                                                                                                                                                                                                           link
    ## <chr>                                                                                                                                                                                                                                                                          <chr>
    ## 1 ACTAMERICA-Elevation_C130_20160711_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303731315f52302e696374
    ## 2 ACTAMERICA-Elevation_C130_20160715_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303731355f52302e696374
    ## 3 ACTAMERICA-Elevation_C130_20160718_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303731385f52302e696374
    ## 4 ACTAMERICA-Elevation_C130_20160719_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303731395f52302e696374
    ## 5 ACTAMERICA-Elevation_C130_20160721_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303732315f52302e696374
    ## 6 ACTAMERICA-Elevation_C130_20160722_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303732325f52302e696374
    ## 7 ACTAMERICA-Elevation_C130_20160725_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303732355f52302e696374
    ## 8 ACTAMERICA-Elevation_C130_20160726_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303732365f52302e696374
    ## 9 ACTAMERICA-Elevation_C130_20160727_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303732375f52302e696374
    ## 10 ACTAMERICA-Elevation_C130_20160801_R0.ict http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d456c65766174696f6e5f433133305f32303136303830315f52302e696374
    ## # ... with 709 more rows
    

    Get the ones you care about:

    picarro <- filter(xdf, grepl("PICARRO", filename))
    

    Download them:

    walk2(picarro$link, picarro$filename, download.file)
    ## trying URL 'http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d5049434152524f5f433133305f32303136303532375f52422e696374'
    ## Content type 'text/plain' length 1023662 bytes (999 KB)
    ## ==================================================
    ##   downloaded 999 KB
    ## 
    ## trying URL 'http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d5049434152524f5f433133305f32303136303731315f52302e696374'
    ## Content type 'text/plain' length 886392 bytes (865 KB)
    ## ==================================================
    ##   downloaded 865 KB
    ## 
    ## trying URL 'http://www-air.larc.nasa.gov/cgi-bin/enzFile?f49DA0512C4E81E3C01FDB44A33CD88AAFE2f7075622d6169722f414354414d45524943412f323031362f433133305f41495243524146542f444947414e47492e4a4f534855412f414354414d45524943412d5049434152524f5f433133305f32303136303731355f52302e696374'
    ## Content type 'text/plain' length 530339 bytes (517 KB)
    ## ==================================================
    ##   downloaded 517 KB
    

    etc.