Say that I have scraped code like the following:
library(rvest)
library(dplyr)
test <- minimal_html('
<div class="entry">
<div class="book">
<div class="booktitle">Book 1</div>
<div class="year">1991</div>
</div>
<div class="book dont-use">
<div class="booktitle">Book 2</div>
<div class="year">1979</div>
</div>
<div class="book">
<div class="booktitle">Book 3</div>
<div class="year">1399</div>
</div>
<div class="book dont-use">
<div class="booktitle">Book 4</div>
<div class="year">1949</div>
</div>
</div>')
To select everything that contains book
in its class, I can use:
test %>% html_elements(".book")
This returns all four objects.
However, I do not want to select the second and fourth entries, which have as their class book dont-use
. How can I instead select only the first and third entries? In other words, how can I modify the code to select only exactly book
?
You can use attribute value selector:
library(rvest)
test |>
html_elements("[class='book']")
#> {xml_nodeset (2)}
#> [1] <div class="book">\n <div class="booktitle">Book 1</div>\n ...
#> [2] <div class="book">\n <div class="booktitle">Book 3</div>\n ...
Created on 2024-08-08 with reprex v2.1.1