I am trying to get all links titled "Read More" from this page using RSelenium
and rvest
The code I'm using is the following
igop_get_links <- function(url = "https://igop.uab.cat/category/publicacions/"){
site <- rvest::read_html(url)
taula <- rvest::html_elements(site, ".paginated_content")
text <- rvest::html_text(rvest::html_elements(taula, "a"))
links <- rvest::html_attr(rvest::html_elements(taula, "a"), "href")
df <- data.frame(text = text,
url = links)
df <- df[df$text== "Read More",]
return(df)
}
igop_get_pages <- function(url = "https://igop.uab.cat/category/publicacions"){
links <- igop_get_links(url)
# get max number of pages
site <- rvest::read_html(url)
max <- rvest::html_text(rvest::html_elements(site, ".pagination"))
max <- strsplit(max, "\n\t\t\t\t")
max <- sapply(max, function(x) gsub("\n|\t|\\.{3}", "", x), USE.NAMES = FALSE)
max <- max(as.numeric(max[max != ""]))
remDr <- RSelenium::rsDriver(
remoteServerAddr = "localhost",
port = 4445L,
browser = "firefox",chromever = NULL,
iedrver = NULL,
phantomver = NULL
)
remDr <- remDr[["client"]]
remDr$navigate(url)
for(i in 1:(max-1)){
webElem <- remDr$findElement(using = 'css selector',"a.next")
webElem$clickElement()
remDr$setTimeout(type = "page load", milliseconds = 10000)
linkspage <- igop_get_links(remDr$getCurrentUrl()[[1]])
links <- rbind(links, linkspage)
# linkspage <- s |>
# rvest::session_follow_link(css = "a.next") |>
# igop_get_links()
# links <- rbind(links, linkspage)
}
remDr$close()
return(links)
}
However, when I try to run t3 <- igop_get_pages()
either one of these three things happens without me changing any of the code.
It crashes and returns the following error
Selenium message:No active session with ID 87c316d8-ded8-41e7-94d7-4a119e4006c1
Error: Summary: NoSuchDriver
Detail: A session is either terminated or not started
Further Details: run errorDetails method
It crashes with the following message
Could not open firefox browser.
Client error message:
Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
Further Details: run errorDetails method
Check server log for further details.
Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
Or it doesn't throw any error but it is incapable of navigating further than the second page, i.e. reads the first page, clicks "next" button, reads the second page and then goes back to the first page and repeats the process. This should not be happening,the "previous" button has a different css selector (a.prev
predictably). I have tried using rvest::session_follow_link
but it does not work, since the URL as such does not change (it's always https://igop.uab.cat/category/publicacions/# instead of https://igop.uab.cat/category/publicacions/2-3-whatever).
I am using firefox 118.0.2 on Windows.
Content is updated by Ajax calls, first a POST request is sent to admin-ajax.php
which will then return articles for the requested page number. You can find that call when you check requests in the network tab of your browser's dev. tools and you can mimic it yourself. But instead of RSelenium I'd recommend handling this with just rvest
and httr2
, you can copy actual request from your browser dev. tools as cURL and pass it through httr2::curl_translate()
to get a translated httr2
code, which you can further tweak -- for example test if all those headers are actually required and if it's possible to fiddle with request parameters. Apparently we can increase post_per_page
, and if we also set to_page
to 1
we can get all 60 articles with just one request. post_per_page
does not have to match the actual article count, we test with something like 100 too.
The following example extracts 3 links from every article: title, author and comment count.
library(httr2)
library(rvest)
library(dplyr)
library(tidyr)
library(purrr)
# request list of articles though Wordpress admin-ajax.php,
# a POST call, so we'll use httr2;
# call is extracted from brwser's dev tools as cURL, translated with
# httr2::curl_translate(), few parts removed by trial and error;
# modified "to_page=1&posts_per_page=100" to control returned article collection
request("https://igop.uab.cat/wp-admin/admin-ajax.php") %>%
req_body_raw("action=extra_blog_feed_get_content&et_load_builder_modules=1&blog_feed_nonce=7e1f0a6567&to_page=1&posts_per_page=100&order=desc&orderby=date&categories=226&show_featured_image=1&blog_feed_module_type=masonry&et_column_type=&show_author=1&show_categories=1&show_date=1&show_rating=1&show_more=1&show_comments=1&date_format=M+j%2C+Y&content_length=excerpt&hover_overlay_icon=&use_tax_query=1&tax_query%5B0%5D%5Btaxonomy%5D=category&tax_query%5B0%5D%5Bterms%5D%5B%5D=publications-en&tax_query%5B0%5D%5Bfield%5D=slug&tax_query%5B0%5D%5Boperator%5D=IN&tax_query%5B0%5D%5Binclude_children%5D=true", "application/x-www-form-urlencoded; charset=UTF-8") %>%
req_perform() %>%
resp_body_html() %>%
# extract arcticle elements, returns xml_nodeset that we can process as a list
html_elements("article") %>%
# extract title / author / comments elemenets from every article,
# we'll have a list of named list of html_nodes
map(\(a) list(
title = html_element(a, ".post-title.entry-title a"),
author = html_element(a, ".vcard a[rel='author']"),
comments = html_element(a, ".vcard a.comments-link")
)) %>%
# apply a function to every html_node in out list (60 x 3) to extract href and text
map_depth(2, \(a) list(url = html_attr(a, "href"),
text = html_text(a) %>% trimws())) %>%
# current item structure looks like this:
# $ :List of 3
# ..$ title :List of 2
# .. ..$ url : chr "https://igop.uab.cat/2023/03/04/el-arte-de-pactar/"
# .. ..$ text: chr "El arte de pactar"
# ..$ author :List of 2
# .. ..$ url : chr "https://igop.uab.cat/author/igop/"
# .. ..$ text: chr "IGOP"
# ..$ comments:List of 2
# .. ..$ url : chr "https://igop.uab.cat/2023/03/04/el-arte-de-pactar/#comments"
# .. ..$ text: chr "0"
# rbind list and convert to tibble of 3 nested columns(title, author, comments),
# each column includes url & text)
do.call(rbind, args = .) %>% as.data.frame() %>%
as_tibble() %>%
# unnest to get 6 columns
unnest_wider(everything(), names_sep = ".")
Result:
#> # A tibble: 60 × 6
#> title.url title.text author.url author.text comments.url comments.text
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 https://igop.ua… El arte d… https://i… IGOP https://igo… 0
#> 2 https://igop.ua… Intersect… https://i… IGOP https://igo… 0
#> 3 https://igop.ua… EU agenci… https://i… IGOP https://igo… 0
#> 4 https://igop.ua… El apoyo … https://i… IGOP https://igo… 0
#> 5 https://igop.ua… The doubl… https://i… IGOP https://igo… 0
#> 6 https://igop.ua… Evaluatin… https://i… IGOP https://igo… 0
#> 7 https://igop.ua… Residenci… https://i… IGOP https://igo… 0
#> 8 https://igop.ua… Governmen… https://i… IGOP https://igo… 0
#> 9 https://igop.ua… Beyond re… https://i… IGOP https://igo… 0
#> 10 https://igop.ua… The emerg… https://i… IGOP https://igo… 0
#> # ℹ 50 more rows
Created on 2023-10-24 with reprex v2.0.2