rweb-scrapingtidyverservest

How to scrape hierarchical web data into tabular format using rvest?


I am generally familiar with rvest. I know the difference between html_elements() and html_element(). But I can't get my head around this problem:

Suppose that we have data like the one that is on this webpage. The data is in a hierarchical format and each header has a different number of subheadings.

When I try to scrape, I get 177 headers. But, the subheadings are actually 270. I want to extract the data into a tidy format. But with different vector sizes, I can't easily combine them into a tibble.

Here is my code with some comments about the results:

page <- read_html("https://postdocs.stanford.edu/about/department-postdoc-admins")

person_departments <- page %>% 
    html_elements(".item-list") %>% 
    html_element("h3") %>% 
    html_text2()
# The above code returns 

person_names <- page %>% 
  html_elements(".item-list li") %>% 
  html_element("h4") %>% 
  html_text2()
# This one returns 270 names (some departments have more than 1 admin)

# Using the above codes, I can't get a nice table with two columns, one for the name and one for the person's department.


Solution

  • Here's somewhat different approach, start with .item-list > ul > li elements, from there you can directly extract name, phone and email. If any of those elements are not available in current li element, you conveniently get an NA value in resulting frame (e.g for Vacant positions and for cases where only Email is listed).

    For h4 switch to xpath as it allows to search for ancestors, this way you can get matching h4 for every li in the element list.

    library(dplyr, warn.conflicts = FALSE)
    library(rvest)
    library(purrr)
    
    page <- read_html("https://postdocs.stanford.edu/about/department-postdoc-admins")
    
    page |> 
      html_elements(".item-list > ul > li") |> 
      map(\(li) 
          list(
            h3    = html_element(li, xpath = "./ancestor::div[@class='item-list']/h3"),
            h4    = html_element(li, "h4"),
            phone = html_element(li, ".views-field-field-s-person-phone-display"),
            email = html_element(li, ".views-field-field-s-person-email > a")
          ) |> map(html_text, trim = TRUE)
      ) |> 
      bind_rows() |> 
      # just to keep complete set of contact details from appearing in a SO post
      mutate(across(h4:email, \(x) stringr::str_trunc(x, 13, side = "center")))
    #> # A tibble: 270 × 4
    #>    h3                                      h4            phone         email    
    #>    <chr>                                   <chr>         <chr>         <chr>    
    #>  1 Advanced Residency Training at Stanford Sofia...zales Phone...-9139 sofia...…
    #>  2 Aeronautics and Astronautics            Jenny Scholes Phone...-5967 jscho...…
    #>  3 African & African-Amer Studies          Ashan...hnson Phone...-3969 ashan...…
    #>  4 Anesthes, Periop & Pain Med             Natal...brera Phone...-0648 ndarl...…
    #>  5 Anesthes, Periop & Pain Med             Ashle...hnson Phone...-7212 ashle...…
    #>  6 Anesthes, Periop & Pain Med             Jessi...tinez Phone...-8189 jimen...…
    #>  7 Anthropology                            Julia...itler Phone...-0800 js259...…
    #>  8 Archaeology                             Bilge...dogan Phone...-5731 bilge...…
    #>  9 Bill Lane Center for the American West  Vacant        <NA>          <NA>     
    #> 10 Biochemistry                            Dan Carino    Phone...-6161 dpcar...…
    #> # ℹ 260 more rows
    

    In a way it's a matter of style preference (well.. performance too), but we can also use those same building blocks without map() / lapply() iteration. Using singular html_element() in following context might not sound too intuitive, but there's this bit in doc:

    html_element() returns a nodeset the same length as the input

    Must say I personally managed to ignore this until today..

    Anyway, we can get identical result to previous example with:

    li <- html_elements(page, ".item-list > ul > li")
    tibble(
      h3 = html_element(li, xpath = "./ancestor::div[@class='item-list']/h3") |> html_text(trim = TRUE),
      h4 = html_element(li, "h4") |> html_text(trim = TRUE),
      phone = html_element(li, ".views-field-field-s-person-phone-display") |> html_text(trim = TRUE),
      email = html_element(li, ".views-field-field-s-person-email > a") |> html_text(trim = TRUE)
    )
    

    Becnhmark (note seconds vs milliseconds):

    library(dplyr)
    library(rvest)
    library(purrr)
    
    page <- read_html("https://postdocs.stanford.edu/about/department-postdoc-admins")
    bm <- bench::mark(
      map_li = page |> 
        html_elements(".item-list > ul > li") |> 
        map(\(li) 
            list(
              h3    = html_element(li, xpath = "./ancestor::div[@class='item-list']/h3"),
              h4    = html_element(li, "h4"),
              phone = html_element(li, ".views-field-field-s-person-phone-display"),
              email = html_element(li, ".views-field-field-s-person-email > a")
            ) |> map(html_text, trim = TRUE)
        ) |> 
        bind_rows(),
      vec_li = {
        li <- html_elements(page, ".item-list > ul > li")
        tibble(
          h3 = html_element(li, xpath = "./ancestor::div[@class='item-list']/h3") |> html_text(trim = TRUE),
          h4 = html_element(li, "h4") |> html_text(trim = TRUE),
          phone = html_element(li, ".views-field-field-s-person-phone-display") |> html_text(trim = TRUE),
          email = html_element(li, ".views-field-field-s-person-email > a") |> html_text(trim = TRUE)
        )
      }, iterations = 10
    )
    bm  
    #> # A tibble: 2 × 6
    #>   expression      min   median `itr/sec` mem_alloc `gc/sec`
    #>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    #> 1 map_li        1.16s    1.19s     0.842    16.5MB     8.50
    #> 2 vec_li      33.63ms  34.95ms    27.5     227.2KB     5.50
    ggplot2::autoplot(bm)