Say that I use read_html_live()
from the rvest
package to pull some code that looks like this:
books <- minimal_html('
<div>
<div class="book">
<div class="booktitle">Book 1</div>
<div class="year">1999</div>
<div class="author">Author 1</div>
<div class="author">Author 2</div>
<div class="author">Author 3</div>
</div>
<div class="book">
<div class="booktitle">Book 2</div>
<div class="year">2022</div>
<div class="author">Author 4</div>
</div>
<div class="book">
<div class="booktitle">Book 3</div>
<div class="year">1845</div>
<div class="author">Author 5</div>
<div class="author">Author 6</div>
<div class="author">Author 7</div>
<div class="author">Author 8</div>
</div>
</div>')
I would like to use the rvest
package to create a data frame (or tibble would also be fine) with the information contained above. I would like it to be organized at the author level, so each row will contain an author, the booktitle, and the year.
If I only cared about the first author, it would be easy. Something like:
data0 <- books %>% html_elements(".book")
title <- data0 %>% html_element(".booktitle") %>% html_text2()
year <- data0 %>% html_element(".year") %>% html_text2()
author1 <- data0 %>% html_element("author") %>% html_text2()
data <- data.frame(title, year, author1)
However, I would actually like to extract all authors, the authors being "children" within book. And the dataframe would now have eight rows, one for each author. For instance, row 8 would have Book 3
, 1845
, and Author 8
. How can I do this?
Here is a rough idea, but I am looking for easier solutions:
data0 <- books %>% html_elements(".book")
title <- data0 %>% html_element(".booktitle") %>% html_text2()
year <- data0 %>% html_element(".year") %>% html_text2()
authors <- data0 %>% html_element(".author")
And then loop over the three elements of authors and save each of them to a dataframe. And then associate each of these author dataframes with the relevant title and year and somehow transform it to be a long data frame.
Here is one approach which uses lapply
to loop over the book nodes:
library(rvest)
library(dplyr, warn = FALSE)
books <- minimal_html('
<div>
<div class="book">
<div class="booktitle">Book 1</div>
<div class="year">1999</div>
<div class="author">Author 1</div>
<div class="author">Author 2</div>
<div class="author">Author 3</div>
</div>
<div class="book">
<div class="booktitle">Book 2</div>
<div class="year">2022</div>
<div class="author">Author 4</div>
</div>
<div class="book">
<div class="booktitle">Book 3</div>
<div class="year">1845</div>
<div class="author">Author 5</div>
<div class="author">Author 6</div>
<div class="author">Author 7</div>
<div class="author">Author 8</div>
</div>
</div>')
data0 <- books %>%
html_elements(".book") |>
lapply(\(x) {
tibble(
title = x |> html_element(".booktitle") |> html_text2(),
year = x |> html_element(".year") |> html_text2(),
authors = x |> html_elements(".author") |> html_text2(),
)
}) |>
bind_rows()
data0
#> # A tibble: 8 × 3
#> title year authors
#> <chr> <chr> <chr>
#> 1 Book 1 1999 Author 1
#> 2 Book 1 1999 Author 2
#> 3 Book 1 1999 Author 3
#> 4 Book 2 2022 Author 4
#> 5 Book 3 1845 Author 5
#> 6 Book 3 1845 Author 6
#> 7 Book 3 1845 Author 7
#> 8 Book 3 1845 Author 8