rselenium-webdriverrselenium

RSelenium -> click checkbox files


Basically, I would like to automatically download multiple files at once from a webpage ->> http://alertario.rio.rj.gov.br/download/dados-pluviometricos/

I am currently following the tutorials from: https://www.youtube.com/watch?v=BK_JBk_l5uQ; also here:https://github.com/ggSamoora/TutorialsBySamoora/blob/main/R_downloader_Tutorial.R

BUT, to download it, I need to specify (select) a few fields beforehand. Check image bellow. enter image description here

May someone help me with automating that?

My current stage:

#install.packages("RSelenium")
#install.packages("netstat")
#install.packages("binman")

# load the necessary packages
library(tidyverse)
library(RSelenium)
library(netstat)

binman::list_versions("geckodriver")
# "0.32.1" "0.32.2" "0.33.0"

# connecting to selenium server
rs_driver_object <- rsDriver(browser = 'firefox',
                             port = free_port())

# access the client object
remDr <- rs_driver_object$client

# open a web browser
remDr$open()

# navigate to the website containing the database
remDr$navigate("http://alertario.rio.rj.gov.br/download/dados-pluviometricos/")

I am expecting to download all the data available from this page for a research project.


Solution

  • This particular problem doesn't actually require RSelenium, so if you're open to a more typical approach then this answer might work for you. The website uses POST requests to pull down the data as a zip file, so all we need to do is make our own POST request. I prefer the httr package for this but you could use whichever one you'd like.

    We get the information we need (body and headers) from submitting a single request manually and using Chrome's devtools to see what the request consisted of:

    enter image description here

    In this case, the important information is the request URL

    http://websempre.rio.rj.gov.br/dados/pluviometricos/plv/

    the request headers

    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
    Accept-Encoding: gzip, deflate
    Accept-Language: en-US,en;q=0.9
    Cache-Control: max-age=0
    Connection: keep-alive
    Content-Length: 949
    Content-Type: application/x-www-form-urlencoded
    Cookie: _ga=GA1.4.856843378.1683144629; _gid=GA1.4.1868063308.1683144629; BIGipServer~interno~pool_websempre_http=rd1o00000000000000000000ffff0a02df72o80; _gat=1; TS01a4bab6=01a427213d9189188aaff0fbe3a73727c18f2fc4dc5b10c78d93f8867a481703e3840be508d0f33440460c6f7de39c5d1e4e830651541ff39ba0a1913d3ce11fc2a21fb05d; TS97dc297c027=087c8a1c25ab2000cfabcd20d6f9ccacc7398aab844381c3d40417e34a9d0935ff715cc2f2b63ac208608fab44113000be24740de9c4c96989c4111d3b3ee12ea1b6b438e414a7536af359832b14805819bed161727885be6f511f76c8015cd3
    DNT: 1
    Host: websempre.rio.rj.gov.br
    Origin: http://websempre.rio.rj.gov.br
    Referer: http://websempre.rio.rj.gov.br/dados/pluviometricos/plv/
    Upgrade-Insecure-Requests: 1
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
    

    and the request body (from the "Payload" tab)

    csrfmiddlewaretoken=t0QnQFV4xRrh3eXmIHaxGRAXCDeDBM7F&1-check=on&1-choice=1997&2-check=on&2-choice=1997&3-check=on&3-choice=1997&4-check=on&4-choice=1997&5-check=on&5-choice=1997&6-check=on&6-choice=1997&7-check=on&7-choice=1997&8-check=on&8-choice=1997&9-check=on&9-choice=1997&10-check=on&10-choice=1997&11-check=on&11-choice=1997&12-check=on&12-choice=1997&13-check=on&13-choice=1997&14-check=on&14-choice=1997&15-check=on&15-choice=1997&16-check=on&16-choice=1997&17-check=on&17-choice=1997&18-check=on&18-choice=1997&19-check=on&19-choice=1997&20-check=on&20-choice=1997&21-check=on&21-choice=1997&22-check=on&22-choice=1997&23-check=on&23-choice=1997&24-check=on&24-choice=1997&25-check=on&25-choice=1997&26-check=on&26-choice=1997&27-check=on&27-choice=1997&28-check=on&28-choice=1997&29-check=on&29-choice=1997&30-check=on&30-choice=1997&31-check=on&31-choice=1997&32-check=on&32-choice=1997&33-check=on&33-choice=1997&all-chek=on&choice=1997
    

    All we need to do now is code those up in an R-friendly way. We need the request body to be a named character vector so we use strsplit, separate, and pull (last two from the tidyr and dplyr packages, respectively):

    chromebody <- "csrfmiddlewaretoken=pZOjhFqzBVeajAXAWhuNOctqSJ1GU04t&1-check=on&1-choice=1997&2-check=on&2-choice=1997&3-check=on&3-choice=1997&4-check=on&4-choice=1997&5-check=on&5-choice=1997&6-check=on&6-choice=1997&7-check=on&7-choice=1997&8-check=on&8-choice=1997&9-check=on&9-choice=1997&10-check=on&10-choice=1997&11-check=on&11-choice=1997&12-check=on&12-choice=1997&13-check=on&13-choice=1997&14-check=on&14-choice=1997&15-check=on&15-choice=1997&16-check=on&16-choice=1997&17-check=on&17-choice=1997&18-check=on&18-choice=1997&19-check=on&19-choice=1997&20-check=on&20-choice=1997&21-check=on&21-choice=1997&22-check=on&22-choice=1997&23-check=on&23-choice=1997&24-check=on&24-choice=1997&25-check=on&25-choice=1997&26-check=on&26-choice=1997&27-check=on&27-choice=1997&28-check=on&28-choice=1997&29-check=on&29-choice=1997&30-check=on&30-choice=1997&31-check=on&31-choice=1997&32-check=on&32-choice=1997&33-check=on&33-choice=1997&all-chek=on&choice=1997"
    body <- strsplit(chromebody, "&")[[1]] %>%
      data.frame(init=.) %>%
      separate(init, into = c("name", "value"), sep = "=") %>%
      pull(value, name) %>%
      as.list()
    

    The middleware token here also seems to change with each request, so you'll probably have to sub in one of your own.

    Then add the necessary headers as a named list:

    heads <- add_headers(c(
      Accept="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
      `Accept-Encoding`="gzip, deflate",
      `Accept-Language`="en-US,en;q=0.9",
      `Cache-Control`="max-age=0",
      Connection="keep-alive",
      # `Content-Length`="561",
      # `Content-Type`="application/x-www-form-urlencoded",
      Cookie="_ga=GA1.4.856843378.1683144629; _gid=GA1.4.1868063308.1683144629; _gat=1; BIGipServer~interno~pool_websempre_http=rd1o00000000000000000000ffff0a02df72o80; TS01a4bab6=01a427213d59269d7c5c5786c4e31eb85e255c954942cbdc35eef8018262ac13b6852857ef0c654e412681b104aa44a4962091e7352e34338e28f74cd80e4856cf86705e54; TS97dc297c027=087c8a1c25ab2000d80e586f48bcd94dd8b861f67ab2f6791bcc83cfd93a01b01d36fddcc19b973a08ae4bf3f211300008ca4399aa2b8c0ad52ff748804cba793ea0af1daa5e9b0b374cba61997313ecab8ca3cd8268dd0a9172e0dc4788c29e",
      Host="websempre.rio.rj.gov.br",
      Origin="http://websempre.rio.rj.gov.br",
      Referer="http://websempre.rio.rj.gov.br/dados/pluviometricos/plv/",
      `User-Agent`="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
    ))
    

    I commented out the Content Length and Content Type because they were causing issues but everything else is basically verbatim from the devtools taskbar. The cookie here may also change - you'll likely have to sub in your own values after submitting a request of your own.

    Then, all we need to do is make the POST request with the arguments. Here I use the write_disk() because I couldn't figure out how to unencode/unzip while staying in memory. Here I just write the files out to my Downloads folder but you'll likely want to change the path to your working directory.

    post_response <- POST(base_url, body = body, config = heads, write_disk(path = "~/../Downloads/tempfile.zip", overwrite = TRUE))
    

    However, note that this only pulls down files for a single year. You'll have to write a quick loop to pull down the files from every year by replacing the "1997" in the current request body with 1998, 1999, etc.

    Finally, note that you're hitting their server with a lot of requests for this data, so please be mindful of the request rate.