rdownloadrcurl

Validate data from website before downloading in R


I have a bunch of weather data files I want to download, but there's a mix of website url's that have data and those that don't. I'm using the download.file function in R to download the text files, which is working fine, but I'm also downloading a lot of empty text files because all the url's are valid, even if no data is present.

For example, this url provides good data.

http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=12&FROM=3000&TO=3000&STNM=72645

But this one doesn't.

http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=1970&MONTH=12&FROM=3000&TO=3000&STNM=72645

Is there a way to check to see if there's valid data in the text file before I download it? I've looked for something in the RCurl package, but didn't see what I needed. Thank you.


Solution

  • You can use httr::HEAD to determine the data size before downloading it. Note that this saves you the "pain" of downloading; if there is any cost on the server side, it feels the query-pain even if you do not download it. (These two seem quick enough, perhaps it's not a problem.)

    # good data
    res1 <- httr::HEAD("http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=12&FROM=3000&TO=3000&STNM=72645")
    httr::headers(res1)$`content-length`
    # [1] "9435"
    
    # no data
    res2 <- httr::HEAD("http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=1970&MONTH=12&FROM=3000&TO=3000&STNM=72645")
    httr::headers(res2)$`content-length`
    # NULL
    

    If the API provides a function for estimating size (or at least presence of data), then it might be nicer to the remote end to use that instead of using this technique. For example: let's assume that an API call requires a 20 second SQL query. A call to HEAD will take 20 seconds, just like a call to GET, the only difference being that you don't get the data. If you see that you will get data and then subsequently call httr::GET(.), then you'll wait another 20 seconds (unless the remote end is caching queries).

    Alternatively, they may have a heuristic to find presence of data, perhaps just a simple yes/no, that only takes a few seconds. In that case, it would be much "nicer" for you to make a 3 second "is data present" API call before calling the 20-second full query call.

    Bottom line: if the API has a "data size" estimator, use it, otherwise HEAD should work fine.

    As an alternative to HEAD, just GET the data, check the content-length, and save to file only if found:

    res1 <- httr::GET("http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=12&FROM=3000&TO=3000&STNM=72645")
    stuff <- as.character(httr::content(res1))
    if (!is.null(httr::headers(res1)$`content-length`)) {
      writeLines(stuff, "somefile.html")
    }
    # or do something else with the results, in-memory