I have a bunch of weather data files I want to download, but there's a mix of website url's that have data and those that don't. I'm using the download.file function in R to download the text files, which is working fine, but I'm also downloading a lot of empty text files because all the url's are valid, even if no data is present.
For example, this url provides good data.
http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=12&FROM=3000&TO=3000&STNM=72645
But this one doesn't.
http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=1970&MONTH=12&FROM=3000&TO=3000&STNM=72645
Is there a way to check to see if there's valid data in the text file before I download it? I've looked for something in the RCurl package, but didn't see what I needed. Thank you.
You can use httr::HEAD
to determine the data size before downloading it. Note that this saves you the "pain" of downloading; if there is any cost on the server side, it feels the query-pain even if you do not download it. (These two seem quick enough, perhaps it's not a problem.)
# good data
res1 <- httr::HEAD("http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=12&FROM=3000&TO=3000&STNM=72645")
httr::headers(res1)$`content-length`
# [1] "9435"
# no data
res2 <- httr::HEAD("http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=1970&MONTH=12&FROM=3000&TO=3000&STNM=72645")
httr::headers(res2)$`content-length`
# NULL
If the API provides a function for estimating size (or at least presence of data), then it might be nicer to the remote end to use that instead of using this technique. For example: let's assume that an API call requires a 20 second SQL query. A call to HEAD
will take 20 seconds, just like a call to GET
, the only difference being that you don't get the data. If you see that you will get data and then subsequently call httr::GET(.)
, then you'll wait another 20 seconds (unless the remote end is caching queries).
Alternatively, they may have a heuristic to find presence of data, perhaps just a simple yes/no, that only takes a few seconds. In that case, it would be much "nicer" for you to make a 3 second "is data present" API call before calling the 20-second full query call.
Bottom line: if the API has a "data size" estimator, use it, otherwise HEAD
should work fine.
As an alternative to HEAD
, just GET
the data, check the content-length, and save to file only if found:
res1 <- httr::GET("http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=12&FROM=3000&TO=3000&STNM=72645")
stuff <- as.character(httr::content(res1))
if (!is.null(httr::headers(res1)$`content-length`)) {
writeLines(stuff, "somefile.html")
}
# or do something else with the results, in-memory