rrvestrselenium

RSelenium not moving to third page or crashes with errors No active session with ID or unknown server-side error


I am trying to get all links titled "Read More" from this page using RSelenium and rvest

The code I'm using is the following

igop_get_links <- function(url = "https://igop.uab.cat/category/publicacions/"){
  site <- rvest::read_html(url)
  taula <- rvest::html_elements(site, ".paginated_content")
  text <- rvest::html_text(rvest::html_elements(taula, "a"))
  links <- rvest::html_attr(rvest::html_elements(taula, "a"), "href")
  df <- data.frame(text = text,
                   url = links)
  df <- df[df$text== "Read More",]
  return(df)
}

igop_get_pages <- function(url = "https://igop.uab.cat/category/publicacions"){
  links <- igop_get_links(url)
  # get max number of pages
  site <- rvest::read_html(url)
  max <- rvest::html_text(rvest::html_elements(site, ".pagination"))
  max <- strsplit(max, "\n\t\t\t\t")
  max <- sapply(max, function(x) gsub("\n|\t|\\.{3}", "", x), USE.NAMES = FALSE)
  max <- max(as.numeric(max[max != ""]))
  remDr <- RSelenium::rsDriver(
    remoteServerAddr = "localhost",
    port = 4445L,
    browser = "firefox",chromever = NULL,
    iedrver = NULL,
    phantomver = NULL
  )
  remDr <- remDr[["client"]]
  remDr$navigate(url)
  for(i in 1:(max-1)){
    webElem <- remDr$findElement(using = 'css selector',"a.next")
    webElem$clickElement()
    remDr$setTimeout(type = "page load", milliseconds = 10000)
    linkspage <- igop_get_links(remDr$getCurrentUrl()[[1]])
    links <- rbind(links, linkspage)
    # linkspage <- s |>
    #   rvest::session_follow_link(css = "a.next") |>
    #   igop_get_links()
    # links <- rbind(links, linkspage)
  }
  remDr$close()
  return(links)

}

However, when I try to run t3 <- igop_get_pages() either one of these three things happens without me changing any of the code. It crashes and returns the following error

Selenium message:No active session with ID 87c316d8-ded8-41e7-94d7-4a119e4006c1

Error:   Summary: NoSuchDriver
     Detail: A session is either terminated or not started
     Further Details: run errorDetails method

It crashes with the following message

Could not open firefox browser.
Client error message:
     Summary: UnknownError
     Detail: An unknown server-side error occurred while processing the command.
     Further Details: run errorDetails method
Check server log for further details.
Error in checkError(res) : 
  Undefined error in httr call. httr output: length(url) == 1 is not TRUE

Or it doesn't throw any error but it is incapable of navigating further than the second page, i.e. reads the first page, clicks "next" button, reads the second page and then goes back to the first page and repeats the process. This should not be happening,the "previous" button has a different css selector (a.prev predictably). I have tried using rvest::session_follow_link but it does not work, since the URL as such does not change (it's always https://igop.uab.cat/category/publicacions/# instead of https://igop.uab.cat/category/publicacions/2-3-whatever).

I am using firefox 118.0.2 on Windows.


Solution

  • Content is updated by Ajax calls, first a POST request is sent to admin-ajax.php which will then return articles for the requested page number. You can find that call when you check requests in the network tab of your browser's dev. tools and you can mimic it yourself. But instead of RSelenium I'd recommend handling this with just rvest and httr2, you can copy actual request from your browser dev. tools as cURL and pass it through httr2::curl_translate() to get a translated httr2 code, which you can further tweak -- for example test if all those headers are actually required and if it's possible to fiddle with request parameters. Apparently we can increase post_per_page, and if we also set to_page to 1 we can get all 60 articles with just one request. post_per_page does not have to match the actual article count, we test with something like 100 too.

    The following example extracts 3 links from every article: title, author and comment count.

    library(httr2)
    library(rvest)
    library(dplyr)
    library(tidyr)
    library(purrr)
    
    # request list of articles though Wordpress admin-ajax.php, 
    # a POST call, so we'll use httr2;
    # call is extracted from brwser's dev tools as cURL, translated with
    # httr2::curl_translate(), few parts removed by trial and error;
    # modified "to_page=1&posts_per_page=100" to control returned article collection
    request("https://igop.uab.cat/wp-admin/admin-ajax.php") %>% 
      req_body_raw("action=extra_blog_feed_get_content&et_load_builder_modules=1&blog_feed_nonce=7e1f0a6567&to_page=1&posts_per_page=100&order=desc&orderby=date&categories=226&show_featured_image=1&blog_feed_module_type=masonry&et_column_type=&show_author=1&show_categories=1&show_date=1&show_rating=1&show_more=1&show_comments=1&date_format=M+j%2C+Y&content_length=excerpt&hover_overlay_icon=&use_tax_query=1&tax_query%5B0%5D%5Btaxonomy%5D=category&tax_query%5B0%5D%5Bterms%5D%5B%5D=publications-en&tax_query%5B0%5D%5Bfield%5D=slug&tax_query%5B0%5D%5Boperator%5D=IN&tax_query%5B0%5D%5Binclude_children%5D=true", "application/x-www-form-urlencoded; charset=UTF-8") %>% 
      req_perform() %>% 
      resp_body_html() %>% 
      # extract arcticle elements, returns xml_nodeset that we can process as a list
      html_elements("article") %>% 
      # extract title / author / comments elemenets from every article, 
      # we'll have a list of named list of html_nodes
      map(\(a) list(
        title = html_element(a, ".post-title.entry-title a"),
        author = html_element(a, ".vcard a[rel='author']"),
        comments = html_element(a, ".vcard a.comments-link")
        )) %>% 
      # apply a function to every html_node in out list (60 x 3) to extract href and text
      map_depth(2, \(a) list(url = html_attr(a, "href"),
                             text = html_text(a) %>% trimws())) %>% 
      # current item structure looks like this:
      # $ :List of 3
      #  ..$ title   :List of 2
      #  .. ..$ url : chr "https://igop.uab.cat/2023/03/04/el-arte-de-pactar/"
      #  .. ..$ text: chr "El arte de pactar"
      #  ..$ author  :List of 2
      #  .. ..$ url : chr "https://igop.uab.cat/author/igop/"
      #  .. ..$ text: chr "IGOP"
      #  ..$ comments:List of 2
      #  .. ..$ url : chr "https://igop.uab.cat/2023/03/04/el-arte-de-pactar/#comments"
      #  .. ..$ text: chr "0"
      
      # rbind list and convert to tibble of 3 nested columns(title, author, comments), 
      # each column includes url & text)
      do.call(rbind, args = .) %>% as.data.frame() %>%
      as_tibble() %>% 
      # unnest to get 6 columns
      unnest_wider(everything(), names_sep = ".")
    

    Result:

    #> # A tibble: 60 × 6
    #>    title.url        title.text author.url author.text comments.url comments.text
    #>    <chr>            <chr>      <chr>      <chr>       <chr>        <chr>        
    #>  1 https://igop.ua… El arte d… https://i… IGOP        https://igo… 0            
    #>  2 https://igop.ua… Intersect… https://i… IGOP        https://igo… 0            
    #>  3 https://igop.ua… EU agenci… https://i… IGOP        https://igo… 0            
    #>  4 https://igop.ua… El apoyo … https://i… IGOP        https://igo… 0            
    #>  5 https://igop.ua… The doubl… https://i… IGOP        https://igo… 0            
    #>  6 https://igop.ua… Evaluatin… https://i… IGOP        https://igo… 0            
    #>  7 https://igop.ua… Residenci… https://i… IGOP        https://igo… 0            
    #>  8 https://igop.ua… Governmen… https://i… IGOP        https://igo… 0            
    #>  9 https://igop.ua… Beyond re… https://i… IGOP        https://igo… 0            
    #> 10 https://igop.ua… The emerg… https://i… IGOP        https://igo… 0            
    #> # ℹ 50 more rows
    

    Created on 2023-10-24 with reprex v2.0.2