[SOLVED] R to change the values in html form and scrape web data

R to change the values in html form and scrape web data

I would like to scrape the historical weather data from this page http://www.weather.gov.sg/climate-historical-daily.

I am using the code given in this link Using r to navigate and scrape a webpage with drop down html forms.

However, I am not able to get the data probably due to change in structure of the page. In the code from the above link pgform <-html_form(pgsession)[[3]] was used to change the values of the form. I was not able to find a similar form in my case.

url <- "http://www.weather.gov.sg/climate-historical-daily"
pgsession <- html_session(url)
pgsource <- read_html(url)
pgform <- html_form(pgsession)

result in my case

> pgform
[[1]]
<form> 'searchform' (GET http://www.weather.gov.sg/)
<button submit> '<unnamed>
<input text> 's':

Solution

Since the page has a CSV download button and the links it provides follow a pattern, you can generate and download a set of URLs. You'll need a set of the station IDs, which you can scrape from the dropdown itself:

library(rvest)

page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html()

station_id <- page %>% html_nodes('button#cityname + ul a') %>% 
    html_attr('onclick') %>%    # If you need names, grab the `href` attribute, too.
    sub(".*'(.*)'.*", '\\1', .)

which can then be put into expand.grid with the months and years to generate all the necessary combinations:

df <- expand.grid(station_id, 
                  month = sprintf('%02d', 1:12),
                  year = 2014:2016)

(Note if you want 2017 data, you'll need to construct those separately and rbind so as not to construct months that haven't happened yet.)

The combinations can then be paste0ed into URLs:

urls <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_', 
               df$station_id, '_', df$year, df$month, '.csv')

which can be lapplyed across to download all the files:

# Warning! This will download a lot of files! Make sure you're in a clean directory.    
lapply(urls, function(url){download.file(url, basename(url), method = 'curl')})