rcsvdownload

How to download and read .csv file (daily streamflow) from Water Survey of Canada website?


I'm trying to have R automatically download and read a .csv file (daily streamflow) from Water Survey of Canada website to a local directory. To do this manually, I have to click on the "Download?" button at this link. And then, I need to click "Comma Separated Values" under "Discharge (daily mean values)".

From the link above, it is clear that I can modify "stn" (station number), "startDate", and "endDate" to obtain data from different stations and periods.

However, when I click the "Download?" button, the link will change to https://wateroffice.ec.gc.ca/download/index_e.html?results_type=real_time, which does not include any station information. I wonder how to download and read the "Discharge (daily mean values)" .csv file for different stations and periods?

library(httr)
library(readr)

base_url <- "https://wateroffice.ec.gc.ca/report/real_time_e.html?stn=07BJ001&mode=Table&startDate=2024-06-20&endDate=2024-07-23&prm1=46&y1Max=&y1Min=&prm2=47&y2Max=&y2Min=#wb-auto-4"
response <- GET(base_url)

Solution

  • The station and dates you have selected in the first step are not passed via the URL as you correctly observed, but are "remembered" at the server. You got a session cookie when you first selected station and date and via this cookie the server can retrieve your selection (I am by no means a WebDev, so excuse my lack of proper technical terms).

    That is, all we have to do is to simulate this process in the script as well:

    1. We make the first request with station number and dates, but we take care to store the cookie.
    2. Then, we call the download link and make sure we pass this very cookie, such that the server recognizes our previous request.
    3. Eventually we get a zip back, which we can store and further process
    library(httr2)
    library(glue)
    library(readr)
    
    from <- as.Date("2024-06-20")
    to <- as.Date("2024-07-23")
    stn <- "07BJ001"
    
    select_url <- glue("https://wateroffice.ec.gc.ca/report/real_time_e.html?",
                       "stn={stn}&mode=Table&startDate={format(from, '%Y-%m%-%d')}&",
                       "endDate={format(to, '%Y-%m%-%d')}&",
                       "prm1=46&y1Max=&y1Min=&prm2=47&y2Max=&y2Min=#wb-auto-4")
    download_url <- paste0("https://wateroffice.ec.gc.ca/download/report_e.html?",
                           "dt=6&df=csv&ext=zip")
    
    cookie_path <- tempfile()
    result_path <- tempfile(fileext = ".zip")
    
    select_req <- request(select_url) |>
      req_cookie_preserve(cookie_path) |>
      req_perform() 
    
    download_req <- request(download_url) |>
      req_cookie_preserve(cookie_path) |>
      req_perform()
    
    download_req |>
      resp_body_raw() |>
      writeBin(result_path)
    
    td <- tempdir()
    fn <- unzip(result_path, list = TRUE)[, "Name"]
    unzip(result_path, exdir = td)
    
    (dat <- read_csv(file.path(td, fn),
             skip = 9))
                                                                                 
    # Rows: 34 Columns: 3
    # ── Column specification ────────────────────────────────────────────────────────
    # Delimiter: ","
    # dbl  (2): Parameter, Value (m³/s)
    # dttm (1): Date (MST)
    
    # ℹ Use `spec()` to retrieve the full column specification for this data.
    # ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # # A tibble: 34 × 3
    #    `Date (MST)`        Parameter `Value (m³/s)`
    #    <dttm>                  <dbl>          <dbl>
    #  1 2024-06-20 00:00:00         6          19.7 
    #  2 2024-06-21 00:00:00         6          15.7 
    #  3 2024-06-22 00:00:00         6          13   
    #  4 2024-06-23 00:00:00         6          11.1 
    #  5 2024-06-24 00:00:00         6           9.93
    #  6 2024-06-25 00:00:00         6          15.5 
    #  7 2024-06-26 00:00:00         6          21.3 
    #  8 2024-06-27 00:00:00         6         103   
    #  9 2024-06-28 00:00:00         6         165   
    # 10 2024-06-29 00:00:00         6          80.6 
    # # ℹ 24 more rows
    # # ℹ Use `print(n = ...)` to see more rows