I am generally familiar with rvest
. I know the difference between html_elements()
and html_element()
. But I can't get my head around this problem:
Suppose that we have data like the one that is on this webpage. The data is in a hierarchical format and each header has a different number of subheadings.
When I try to scrape, I get 177 headers. But, the subheadings are actually 270. I want to extract the data into a tidy format. But with different vector sizes, I can't easily combine them into a tibble.
Here is my code with some comments about the results:
page <- read_html("https://postdocs.stanford.edu/about/department-postdoc-admins")
person_departments <- page %>%
html_elements(".item-list") %>%
html_element("h3") %>%
html_text2()
# The above code returns
person_names <- page %>%
html_elements(".item-list li") %>%
html_element("h4") %>%
html_text2()
# This one returns 270 names (some departments have more than 1 admin)
# Using the above codes, I can't get a nice table with two columns, one for the name and one for the person's department.
Here's somewhat different approach, start with .item-list > ul > li
elements, from there you can directly extract name, phone and email. If any of those elements are not available in current li
element, you conveniently get an NA
value in resulting frame (e.g for Vacant positions and for cases where only Email is listed).
For h4
switch to xpath
as it allows to search for ancestors, this way you can get matching h4
for every li
in the element list.
library(dplyr, warn.conflicts = FALSE)
library(rvest)
library(purrr)
page <- read_html("https://postdocs.stanford.edu/about/department-postdoc-admins")
page |>
html_elements(".item-list > ul > li") |>
map(\(li)
list(
h3 = html_element(li, xpath = "./ancestor::div[@class='item-list']/h3"),
h4 = html_element(li, "h4"),
phone = html_element(li, ".views-field-field-s-person-phone-display"),
email = html_element(li, ".views-field-field-s-person-email > a")
) |> map(html_text, trim = TRUE)
) |>
bind_rows() |>
# just to keep complete set of contact details from appearing in a SO post
mutate(across(h4:email, \(x) stringr::str_trunc(x, 13, side = "center")))
#> # A tibble: 270 × 4
#> h3 h4 phone email
#> <chr> <chr> <chr> <chr>
#> 1 Advanced Residency Training at Stanford Sofia...zales Phone...-9139 sofia...…
#> 2 Aeronautics and Astronautics Jenny Scholes Phone...-5967 jscho...…
#> 3 African & African-Amer Studies Ashan...hnson Phone...-3969 ashan...…
#> 4 Anesthes, Periop & Pain Med Natal...brera Phone...-0648 ndarl...…
#> 5 Anesthes, Periop & Pain Med Ashle...hnson Phone...-7212 ashle...…
#> 6 Anesthes, Periop & Pain Med Jessi...tinez Phone...-8189 jimen...…
#> 7 Anthropology Julia...itler Phone...-0800 js259...…
#> 8 Archaeology Bilge...dogan Phone...-5731 bilge...…
#> 9 Bill Lane Center for the American West Vacant <NA> <NA>
#> 10 Biochemistry Dan Carino Phone...-6161 dpcar...…
#> # ℹ 260 more rows
In a way it's a matter of style preference (well.. performance too), but we can also use those same building blocks without map()
/ lapply()
iteration. Using singular html_element()
in following context might not sound too intuitive, but there's this bit in doc:
html_element()
returns a nodeset the same length as the input
Must say I personally managed to ignore this until today..
Anyway, we can get identical result to previous example with:
li <- html_elements(page, ".item-list > ul > li")
tibble(
h3 = html_element(li, xpath = "./ancestor::div[@class='item-list']/h3") |> html_text(trim = TRUE),
h4 = html_element(li, "h4") |> html_text(trim = TRUE),
phone = html_element(li, ".views-field-field-s-person-phone-display") |> html_text(trim = TRUE),
email = html_element(li, ".views-field-field-s-person-email > a") |> html_text(trim = TRUE)
)
library(dplyr)
library(rvest)
library(purrr)
page <- read_html("https://postdocs.stanford.edu/about/department-postdoc-admins")
bm <- bench::mark(
map_li = page |>
html_elements(".item-list > ul > li") |>
map(\(li)
list(
h3 = html_element(li, xpath = "./ancestor::div[@class='item-list']/h3"),
h4 = html_element(li, "h4"),
phone = html_element(li, ".views-field-field-s-person-phone-display"),
email = html_element(li, ".views-field-field-s-person-email > a")
) |> map(html_text, trim = TRUE)
) |>
bind_rows(),
vec_li = {
li <- html_elements(page, ".item-list > ul > li")
tibble(
h3 = html_element(li, xpath = "./ancestor::div[@class='item-list']/h3") |> html_text(trim = TRUE),
h4 = html_element(li, "h4") |> html_text(trim = TRUE),
phone = html_element(li, ".views-field-field-s-person-phone-display") |> html_text(trim = TRUE),
email = html_element(li, ".views-field-field-s-person-email > a") |> html_text(trim = TRUE)
)
}, iterations = 10
)
bm
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 map_li 1.16s 1.19s 0.842 16.5MB 8.50
#> 2 vec_li 33.63ms 34.95ms 27.5 227.2KB 5.50
ggplot2::autoplot(bm)