I have been working on step by step solution to find the correspondence author from the collections_html_subpages.
I inspected the website and saw that it was a <a id="corresp-c1" href="mailto:FName@email.com> FName LName</a>
I built the following code. The code works as follows it uses the initial page and mines for the href of the individual articles. Then it supposed to using html_node find that tag in one of the individual articles. Now using lapply and html_text I should be able to extract all the correspondence authors mainly just 1. However, I am stuck even just getting the tag. I do not know where the mistake is in code.
Both correspondence_authors. and t1 return an empty set. Any advice on how I could improve my code to get the desired result would be appreciated.
library(httr) # will be use to make HTML GET and POST requests
library(rvest) # will be used to parse HTML
library(xml2)
library(tidyr) #will be used to remove NA
library(tidyverse)
article_year <- function(year){
}
str_1 <- "https://molecularbrain.biomedcentral.com/articles"
prefix_str_1 <- "https://molecularbrain.biomedcentral.com/"
doc <- httr::GET(str_1)
html <- read_html(content(doc, "text"))
#################### Title ####################
c_listing_title <- html_elements(html,"h3.c-listing__title")
a_element <- html_node(c_listing_title,"a")
a_href <- as.list(html_attr(a_element,"href"))
a_text <- lapply(a_element,html_text)
##################### 2 Page Depth #######################
merge_strings <- function(x){
paste0(prefix_str_1,x)
}
sub_pages <- lapply(a_href,merge_strings)
########################Function_Read_Sub_Pages#####################
read_page_1 <- function(x){
webpages <- httr::GET(x)
html <- rvest::read_html(httr::content(webpages, "text"))
return(html)
}
collection_html_sub_pages <- lapply(sub_pages,read_page_1)
##########################Correspondence_Author###################
correspondence_search <- function(x){
rvest::html_node(x,"a#corresp-c1")
}
collection_html_sub_pages[[1]]
t1 <- rvest::html_element(collection_html_sub_pages[[1]],paste0('#corresp-c1'))
t2 <- rvest::html_elements(t1,"p")
correspondence_authors <- lapply(collection_html_sub_pages, correspondence_search)
I have used helper functions to help construct my code and would to keep using helper functions to keep my code well organized and allow for troubleshooting. I have tried the code above and the rest works but the part of getting the correspondence author.
The article URLs you create are not valid paths on that web server. When you paste()
prefix_str_1
and a_href
, the first ends with a /
and the latter starts with a /
and the resulting URLs look like this: https://molecularbrain.biomedcentral.com/articles//10.1186/s13041-023-01014-0
; the correct URL would be https://molecularbrain.biomedcentral.com/articles/10.1186/s13041-023-01014-0
(no double /
after articles).
Easiest fix is to define prefix_str_1
with out a tailing /
.
prefix_str_1 <- "https://molecularbrain.biomedcentral.com"
You can also significantly simplify your code.
library(rvest)
base_url <- "https://molecularbrain.biomedcentral.com"
index_html <- read_html(file.path(base_url, "articles"))
# Title and Links ---------------------------------------------------------
a_elements <- html_elements(index_html, "h3.c-listing__title a")
a_href <- html_attr(a_elements, "href")
a_text <- html_text(a_elements)
# subpages ----------------------------------------------------------------
html_sub_pages <-
lapply(paste0(base_url, a_href),
read_html)
# Correspondence Author ---------------------------------------------------
lapply(html_sub_pages,
html_elements,
"#corresp-c1") |>
lapply(html_text)
#> [[1]]
#> [1] "Chao Qin"
#>
#> [[2]]
#> [1] "Won Do Heo"
#>
#> [[3]]
#> [1] "Seung-Jae Lee"
#> ...