This is a follow-up question to the one I asked earlier (How to create data frame from rvest scraped website, preserving nested structure of data) and the answer by @stefan. That answer works perfectly for that question.
But what if there are extra layers of nesting?
library(rvest)
library(dplyr, warn = FALSE)
books <- minimal_html('
<div class="entry">
<div class="collection">Collection 1</div>
<div class="book">
<div class="booktitle">Book 1</div>
<div class="year">1999</div>
<div class="author">
<div class="name">Author 1</div>
<div class="city">Austin</div>
</div>
<div class="author">
<div class="name">Author 2</div>
<div class="city">Dallas</div>
</div>
<div class="author">
<div class="name">Author 3</div>
<div class="city">Memphis</div>
</div>
</div>
<div class="book">
<div class="booktitle">Book 2</div>
<div class="year">2022</div>
<div class="author">
<div class="name">Author 4</div>
<div class="city">Houston</div>
</div>
</div>
</div>
<div class="entry">
<div class="collection">Collection 2</div>
<div class="book">
<div class="booktitle">Book 3</div>
<div class="year">1845</div>
<div class="author">
<div class="name">Author 5</div>
</div>
<div class="author">
<div class="name">Author 6</div>
<div class="city">Dayton</div>
</div>
<div class="author">
<div class="name">Author 7</div>
<div class="city">Philadelphia</div>
</div>
</div>
</div>')
As before, I would like things to be at the author level, but an author should have name and city on the same row. Also, there is an extra outer layer, collection
. All authors of books in an entry
should have entry
's collection number. So there should be seven rows, and Author 7
should have these values: Collection 2
, Book 3
, 1845
, Author 7
, and Philadelphia
.
Also note that there will only be one (or none) name and one (or none - Author 5 has no city) city per author. And an entry
will always have exactly one collection
.
How can I extend this code from the prior answer to get my desired solution?
data0 <- books %>%
html_elements(".book") |>
lapply(\(x) {
tibble(
title = x |> html_element(".booktitle") |> html_text2(),
year = x |> html_element(".year") |> html_text2(),
authors = x |> html_elements(".author") |> html_text2(),
)
}) |>
bind_rows()
To build on the linked answer, you can mix in some xpath
to access siblings of books, here it would be the collection element of the same parent.
separate()
is indeed a nice shortcut in this case. For an alternative for handling author details, you could add another iterator, either in current lapply()
or move it to the next processing step. I personally prefer the latter, here I'm first adding authors node lists to the tibble when iterating over all books, then transforming those into sub-tibbles in mutate()
and finally unnesting.
library(rvest)
library(dplyr, warn = FALSE)
books |>
html_elements(".book") |>
lapply(\(x) {
tibble(
collection = x |> html_element(xpath = "../div[@class='collection']") |> html_text2(),
title = x |> html_element(".booktitle") |> html_text2(),
year = x |> html_element(".year") |> html_text2(),
authors = x |> html_elements(".author") |> list()
)
}) |>
bind_rows() |>
mutate(authors = lapply(authors, \(x) tibble(name = html_element(x, ".name") |> html_text2(),
city = html_element(x, ".city") |> html_text2()))) |>
tidyr::unnest(authors, names_sep = ".")
#> # A tibble: 7 × 5
#> collection title year authors.name authors.city
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Collection 1 Book 1 1999 Author 1 Austin
#> 2 Collection 1 Book 1 1999 Author 2 Dallas
#> 3 Collection 1 Book 1 1999 Author 3 Memphis
#> 4 Collection 1 Book 2 2022 Author 4 Houston
#> 5 Collection 2 Book 3 1845 Author 5 <NA>
#> 6 Collection 2 Book 3 1845 Author 6 Dayton
#> 7 Collection 2 Book 3 1845 Author 7 Philadelphia
Example data:
books <- minimal_html('
<div class="entry">
<div class="collection">Collection 1</div>
<div class="book">
<div class="booktitle">Book 1</div>
<div class="year">1999</div>
<div class="author">
<div class="name">Author 1</div>
<div class="city">Austin</div>
</div>
<div class="author">
<div class="name">Author 2</div>
<div class="city">Dallas</div>
</div>
<div class="author">
<div class="name">Author 3</div>
<div class="city">Memphis</div>
</div>
</div>
<div class="book">
<div class="booktitle">Book 2</div>
<div class="year">2022</div>
<div class="author">
<div class="name">Author 4</div>
<div class="city">Houston</div>
</div>
</div>
</div>
<div class="entry">
<div class="collection">Collection 2</div>
<div class="book">
<div class="booktitle">Book 3</div>
<div class="year">1845</div>
<div class="author">
<div class="name">Author 5</div>
</div>
<div class="author">
<div class="name">Author 6</div>
<div class="city">Dayton</div>
</div>
<div class="author">
<div class="name">Author 7</div>
<div class="city">Philadelphia</div>
</div>
</div>
</div>')
Created on 2024-08-08 with reprex v2.1.1