htmlrweb-scrapinghttr

R Find Corresponding Authors from subpages


I have been working on step by step solution to find the correspondence author from the collections_html_subpages.

I inspected the website and saw that it was a <a id="corresp-c1" href="mailto:FName@email.com> FName LName</a>

I built the following code. The code works as follows it uses the initial page and mines for the href of the individual articles. Then it supposed to using html_node find that tag in one of the individual articles. Now using lapply and html_text I should be able to extract all the correspondence authors mainly just 1. However, I am stuck even just getting the tag. I do not know where the mistake is in code.

Both correspondence_authors. and t1 return an empty set. Any advice on how I could improve my code to get the desired result would be appreciated.

library(httr)  # will be use to make HTML GET and POST requests
library(rvest) # will be used to parse HTML
library(xml2)
library(tidyr) #will be used to remove NA
library(tidyverse)
article_year <- function(year){
  
}
str_1 <- "https://molecularbrain.biomedcentral.com/articles"
prefix_str_1 <- "https://molecularbrain.biomedcentral.com/"
doc <- httr::GET(str_1)
html <- read_html(content(doc, "text"))
#################### Title ####################
c_listing_title <- html_elements(html,"h3.c-listing__title")
a_element <- html_node(c_listing_title,"a")
a_href <- as.list(html_attr(a_element,"href"))
a_text <- lapply(a_element,html_text)

##################### 2 Page Depth #######################
merge_strings <- function(x){
  paste0(prefix_str_1,x)
}
sub_pages <- lapply(a_href,merge_strings)

########################Function_Read_Sub_Pages#####################

read_page_1 <- function(x){
  webpages <- httr::GET(x)
  html <- rvest::read_html(httr::content(webpages, "text"))
  return(html)
}

collection_html_sub_pages <- lapply(sub_pages,read_page_1)

##########################Correspondence_Author###################

correspondence_search <- function(x){
  rvest::html_node(x,"a#corresp-c1")
}
collection_html_sub_pages[[1]]
t1 <- rvest::html_element(collection_html_sub_pages[[1]],paste0('#corresp-c1'))
t2 <- rvest::html_elements(t1,"p")
correspondence_authors <- lapply(collection_html_sub_pages, correspondence_search)

I have used helper functions to help construct my code and would to keep using helper functions to keep my code well organized and allow for troubleshooting. I have tried the code above and the rest works but the part of getting the correspondence author.


Solution

  • The article URLs you create are not valid paths on that web server. When you paste() prefix_str_1 and a_href, the first ends with a / and the latter starts with a / and the resulting URLs look like this: https://molecularbrain.biomedcentral.com/articles//10.1186/s13041-023-01014-0; the correct URL would be https://molecularbrain.biomedcentral.com/articles/10.1186/s13041-023-01014-0 (no double / after articles).

    Easiest fix is to define prefix_str_1 with out a tailing /.

    prefix_str_1 <- "https://molecularbrain.biomedcentral.com"
    

    You can also significantly simplify your code.

    library(rvest) 
    
    base_url <- "https://molecularbrain.biomedcentral.com"
    
    index_html <- read_html(file.path(base_url, "articles"))
    
    # Title and Links ---------------------------------------------------------
    
    a_elements <- html_elements(index_html, "h3.c-listing__title a")
    a_href <- html_attr(a_elements, "href")
    a_text <- html_text(a_elements)
    
    # subpages ----------------------------------------------------------------
    
    html_sub_pages <- 
      lapply(paste0(base_url, a_href),
           read_html)
    
    # Correspondence Author ---------------------------------------------------
    
    lapply(html_sub_pages,
           html_elements,
           "#corresp-c1") |> 
      lapply(html_text)
    #> [[1]]
    #> [1] "Chao Qin"
    #> 
    #> [[2]]
    #> [1] "Won Do Heo"
    #> 
    #> [[3]]
    #> [1] "Seung-Jae Lee"
    #> ...