rweb-scrapingtidyverservest

Create object derived from scraped data with multiple layers of nesting


This is a follow-up question to the one I asked earlier (How to create data frame from rvest scraped website, preserving nested structure of data) and the answer by @stefan. That answer works perfectly for that question.

But what if there are extra layers of nesting?

library(rvest)
library(dplyr, warn = FALSE)
books <- minimal_html('
  <div class="entry">
        <div class="collection">Collection 1</div>
        <div class="book">
          <div class="booktitle">Book 1</div>
          <div class="year">1999</div>
          <div class="author">
            <div class="name">Author 1</div>
            <div class="city">Austin</div>
          </div>  
          <div class="author">
            <div class="name">Author 2</div>
            <div class="city">Dallas</div> 
          </div>  
          <div class="author">
            <div class="name">Author 3</div>
            <div class="city">Memphis</div>  
          </div>  
        </div>
        <div class="book">
          <div class="booktitle">Book 2</div>
          <div class="year">2022</div>
          <div class="author">
            <div class="name">Author 4</div>
            <div class="city">Houston</div>
          </div>  
        </div>
  </div>
  <div class="entry">  
        <div class="collection">Collection 2</div>
        <div class="book">
          <div class="booktitle">Book 3</div>
          <div class="year">1845</div>
          <div class="author">
            <div class="name">Author 5</div> 
          </div>  
          <div class="author">
            <div class="name">Author 6</div>
            <div class="city">Dayton</div>
          </div>  
          <div class="author">
            <div class="name">Author 7</div>
            <div class="city">Philadelphia</div>  
          </div>  
        </div>
  </div>')

As before, I would like things to be at the author level, but an author should have name and city on the same row. Also, there is an extra outer layer, collection. All authors of books in an entry should have entry's collection number. So there should be seven rows, and Author 7 should have these values: Collection 2, Book 3, 1845, Author 7, and Philadelphia.

Also note that there will only be one (or none) name and one (or none - Author 5 has no city) city per author. And an entry will always have exactly one collection.

How can I extend this code from the prior answer to get my desired solution?

data0 <- books %>%
    html_elements(".book") |>
    lapply(\(x) {
        tibble(
            title = x |> html_element(".booktitle") |> html_text2(),
            year = x |> html_element(".year") |> html_text2(),
            authors = x |> html_elements(".author") |> html_text2(),
        )
    }) |>
    bind_rows()

Solution

  • To build on the linked answer, you can mix in some xpath to access siblings of books, here it would be the collection element of the same parent.

    separate() is indeed a nice shortcut in this case. For an alternative for handling author details, you could add another iterator, either in current lapply() or move it to the next processing step. I personally prefer the latter, here I'm first adding authors node lists to the tibble when iterating over all books, then transforming those into sub-tibbles in mutate() and finally unnesting.

    library(rvest)
    library(dplyr, warn = FALSE)
    
    books |> 
      html_elements(".book") |>
      lapply(\(x) {
        tibble(
          collection = x |> html_element(xpath = "../div[@class='collection']") |> html_text2(),
          title = x |> html_element(".booktitle") |> html_text2(),
          year = x |> html_element(".year") |> html_text2(),
          authors = x |> html_elements(".author") |> list()
        )
      }) |>
      bind_rows() |> 
      mutate(authors = lapply(authors, \(x) tibble(name = html_element(x, ".name") |> html_text2(),
                                                   city = html_element(x, ".city") |> html_text2())))  |> 
      tidyr::unnest(authors, names_sep = ".")
    #> # A tibble: 7 × 5
    #>   collection   title  year  authors.name authors.city
    #>   <chr>        <chr>  <chr> <chr>        <chr>       
    #> 1 Collection 1 Book 1 1999  Author 1     Austin      
    #> 2 Collection 1 Book 1 1999  Author 2     Dallas      
    #> 3 Collection 1 Book 1 1999  Author 3     Memphis     
    #> 4 Collection 1 Book 2 2022  Author 4     Houston     
    #> 5 Collection 2 Book 3 1845  Author 5     <NA>        
    #> 6 Collection 2 Book 3 1845  Author 6     Dayton      
    #> 7 Collection 2 Book 3 1845  Author 7     Philadelphia
    

    Example data:

    books <- minimal_html('
      <div class="entry">
            <div class="collection">Collection 1</div>
            <div class="book">
              <div class="booktitle">Book 1</div>
              <div class="year">1999</div>
              <div class="author">
                <div class="name">Author 1</div>
                <div class="city">Austin</div>
              </div>  
              <div class="author">
                <div class="name">Author 2</div>
                <div class="city">Dallas</div> 
              </div>  
              <div class="author">
                <div class="name">Author 3</div>
                <div class="city">Memphis</div>  
              </div>  
            </div>
            <div class="book">
              <div class="booktitle">Book 2</div>
              <div class="year">2022</div>
              <div class="author">
                <div class="name">Author 4</div>
                <div class="city">Houston</div>
              </div>  
            </div>
      </div>
      <div class="entry">  
            <div class="collection">Collection 2</div>
            <div class="book">
              <div class="booktitle">Book 3</div>
              <div class="year">1845</div>
              <div class="author">
                <div class="name">Author 5</div> 
              </div>  
              <div class="author">
                <div class="name">Author 6</div>
                <div class="city">Dayton</div>
              </div>  
              <div class="author">
                <div class="name">Author 7</div>
                <div class="city">Philadelphia</div>  
              </div>  
            </div>
      </div>')
    

    Created on 2024-08-08 with reprex v2.1.1