rweb-scrapingtidyverservesthtml-class

With rvest, how to select div class that only contains exact text


Say that I have scraped code like the following:

library(rvest)
library(dplyr)

test <- minimal_html('
  <div class="entry">
        <div class="book">
          <div class="booktitle">Book 1</div>
          <div class="year">1991</div>
        </div>
        <div class="book dont-use">
          <div class="booktitle">Book 2</div>
          <div class="year">1979</div>
        </div>
        <div class="book">
          <div class="booktitle">Book 3</div>
          <div class="year">1399</div>
        </div>
        <div class="book dont-use">
          <div class="booktitle">Book 4</div>
          <div class="year">1949</div>
        </div>        
  </div>')

To select everything that contains book in its class, I can use:

test %>% html_elements(".book")

This returns all four objects.

However, I do not want to select the second and fourth entries, which have as their class book dont-use. How can I instead select only the first and third entries? In other words, how can I modify the code to select only exactly book?


Solution

  • You can use attribute value selector:

    library(rvest)
    
    test |> 
      html_elements("[class='book']")
    #> {xml_nodeset (2)}
    #> [1] <div class="book">\n          <div class="booktitle">Book 1</div>\n       ...
    #> [2] <div class="book">\n          <div class="booktitle">Book 3</div>\n       ...
    

    Created on 2024-08-08 with reprex v2.1.1