cssrweb-scrapingrvestbuttonclick

How to simulate button click using rvest


I am trying to scrape a webpage that requires a button press to populate a table. I am able to achieve this for some buttons but not others. I am trying create a reproducible pipeline using rvest and I am not currently interested in solutions using RSelenium.

The table comes from the following website: https://osf.io/search?resourceType=Registration%2CRegistrationComponent

The table is populated in the lefthand margin upon clicking the "Date created" dropdown.

This is what I have so far:

library(rvest)

url <- "https://osf.io/search?resourceType=Registration%2CRegistrationComponent"
pr_sess <- read_html_live(url)
pr_sess$view()
pr_sess$click("#dateCreated-09883579765073658")

Which gives the following error: "Error in onRejected(reason) : code: -32000 message: Node does not have a layout object"

However, I am able to simulate a click for some elements. For example, I can simulate a click on the "Creator" dropdown directly above "Data created" with the following code:

pr_sess$click("#first-filter-08170256355625345")

I believe this is because I am using the defined ID in css for "Creator", but there is no such ID for "Data created". To be clear, the solution I am looking for should involve using pr_sess$click() and the desired outcome would be toggling the dropdown menu.

EDIT: The code that did seem to work for me previously (pr_sess$click("#first-filter-08170256355625345")), no longer works upon restarting my R session. It seems that the first part of the ID is always the same (i.e., #first-filter-), but the numbers are always different. I'm starting to question whether there is actually a way to make this process reproducible.


Solution

  • as you correctly noted, it's usually easier to use unique HTML identifiers. All HTML elements have specific attributes. Take the Date-created button, which you want to click:

    <button class="_Button_6kisxq _FakeLink_6kisxq _facet-expand-button_13c61c" data-analytics-name="Filter facet toggle Date created" data-test-filter-facet-toggle="Date created" aria-controls="dateCreated-08686056605854187" title="Date created" type="button">
    
      <span>Date created</span>
      <svg class="svg-inline--fa fa-caret-down" data-prefix="fas" data-icon="caret-down" aria-hidden="true" focusable="false" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 320 512">
      <path fill="currentColor" d="M31.3 192h257.3c17.8 0 26.7 21.5 14.1 34.1L174.1 354.8c-7.8 7.8-20.5 7.8-28.3 0L17.2 226.1C4.6 213.5 13.5 192 31.3 192z"></path>
      </svg>
    
    </button>
    

    The attributes are:

    class="_Button_6kisxq _FakeLink_6kisxq _facet-expand-button_13c61c" data-analytics-name="Filter facet toggle Date created" data-test-filter-facet-toggle="Date created" aria-controls="dateCreated-08686056605854187"
    

    Let's take data-test-filter-facet-toggle='Date created' for clicking on the Data Created Filter dropdown!

    Btw. you extract this data yourself by clicking on the little arrow icon in the top right corner and then selecting the button element like this:

    out

    Finally, we can implement clicking on the button after a while. We need to wait for the element to appear, because the Website does not load at once. This is the first rule of Webscraping: Allways wait for the elements to appear. I even implemented somethin similar before and the website allways had different HTML-element names and they were in a different order. Also the loading times were not constant. Therefore I would highly advice to allways wait for the elements to appear first and to make your code as robust as possible to changes to the webdesign. This Page of yours seems to be extra slow sometimes!

    Anayways, you can then even export the date-list which appears to select any filter value next...

    Code:

    library(rvest)
    library(chromote)
    library(purrr)
    library(tibble)
    
    # Start a Chromote session
    b <- ChromoteSession$new()
    
    
    url <- "https://osf.io/search?resourceType=Registration%2CRegistrationComponent"
    pr_sess <- read_html_live(url)
    pr_sess$view()
    
    
    # Click the "Date created" dropdown dynamically
    
    # Step 2: Wait for the button to load and click it
    timeout <- 10  # Maximum wait time in seconds
    button <- NULL
    start_time <- Sys.time()
    
    while (is.null(button) && as.numeric(Sys.time() - start_time) < timeout) {
      button <- tryCatch(
        pr_sess %>% html_element("[data-test-filter-facet-toggle='Date created']"),
        error = function(e) NULL
      )
      Sys.sleep(0.5)  # Check every 0.5 seconds
    }
    
    if (is.null(button)) {
      stop("Button did not appear within the timeout period.")
    }
    
    
    # Click the "Date created" dropdown
    pr_sess$click("[data-test-filter-facet-toggle='Date created']")
    
    # Step 3: Wait for the dropdown to load
    Sys.sleep(2)  # Adjust based on load time
    
    # Step 4: Extract the list items of the ul list below the filter
    facet_list <- pr_sess %>%
      html_elements("ul._facet-list_13c61c li._facet-value_13c61c")
    
    # Step 5: Parse the extracted items into a data frame
    facet_data <- facet_list %>%
      map_df(~ {
        year <- .x %>%
          html_element("button") %>%
          html_text2() %>%
          as.character()
        
        count <- .x %>%
          html_element("span._facet-count_13c61c") %>%
          html_text2() %>%
          as.integer()
        
        tibble(year = year, count = count)
      })
    
    # Print the extracted data
    print(facet_data)
    
    # click on any of the list values, filter with e.g. pr_sess$click("[data-test-filter-facet-value = '2024']")
    

    And this will print the list of available Creation-Dates:

    > # Print the extracted data
    > print(facet_data)
    # A tibble: 14 × 2
       year  count
       <chr> <int>
     1 2024  31020
     2 2023  31001
     3 2022  28099
     4 2021  25604
     5 2020  24456
     6 2019  17142
     7 2018  13833
     8 2017   8751
     9 2016   5688
    10 2015   3314
    11 2014    954
    12 2013    717
    13 2012     91
    14 2011      2
    

    This should give you a good headstart! Btw. what is that you want to do next?