rweb-scrapingxpathrvest

How to properly perform an Xpath English text search with R package rvest (doesn't seem to work currently)?


Learning rvest and would like to use it to query informational websites to determine if they contain (and then can extract) certain information. For instance, on the U.S. CDC main website:

https://www.cdc.gov

I'd like to find out what more information it has on "Outbreaks". I see that the main website has a button "Outbreaks" and thus would like to extract information from that button. I inspected the page in Chrome Devtools and see that the button contains text with the value "Outbreaks"

By just using Chrome Devtools, I can then do Cntrl-F and enter an Xpath query using the text information I'm looking for in a button on the CDC website (e.g. "Outbreaks") as so:

//a[contains(text(),"Outbreaks")]

and Chrome Devtools then locates the appropriate node corresponding to that Outbreak button link that I am looking for.

I expected then the following script in R with rvest would accomplish the same thing:

library(rvest)
library(dplyr)
test <- read_html_live("https://www.cdc.gov")
test %>% html_elements(xpath= "//a[contains(text(),'Outbreaks')]")

However, instead it yields no results:

{xml_nodeset (0)}

I'm therefore not sure if something is wrong with my rvest syntax or if rvest cannot do such text searches. My understanding is that the read_html_live function should be able to scrape even dynamically generated websites using an xpath. Thank you.


Solution

  • As weird as it sounds, try swapping quotes so your xpath expression would read:

    '//a[contains(text(),"Outbreaks")]'
    

    It's probably chromote-related but I'm afraid I don't have a good explanation for this, just that it is reproducible and you are not the first one hit by this.

    library(rvest)
    library(dplyr)
    
    test <- read_html_live("https://www.cdc.gov")
    
    # single-quotes in xpath, no match:
    html_elements(test, xpath = "//a[contains(text(),'Outbreaks')]")
    #> {xml_nodeset (0)}
    
    # swap quotes, " <-> '
    html_elements(test, xpath = '//a[contains(text(), "Outbreaks")]')
    #> {xml_nodeset (3)}
    #> [1] <a href="https://www.cdc.gov/outbreaks/index.html">Outbreaks</a>
    #> [2] <a href="https://www.cdc.gov/outbreaks/index.html" class="btn btn-outline ...
    #> [3] <a href="https://www.cdc.gov/outbreaks/index.html" class="btn btn-outline ...
    
    test$session$Browser$getVersion() |> str()
    #> List of 5
    #>  $ protocolVersion: chr "1.3"
    #>  $ product        : chr "HeadlessChrome/131.0.6778.109"
    #>  $ revision       : chr "@0cd6f1c484c0cf07827aa6482a5e3cdf50395669"
    #>  $ userAgent      : chr "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/131.0.6778.109 Safari/537.36"
    #>  $ jsVersion      : chr "13.1.201.15"
    sessioninfo::package_info() |>
      filter(package %in% c("rvest", "chromote"))
    #>  ! package  * version date (UTC) lib source
    #>  P chromote   0.3.1   2024-08-30 [?] RSPM
    #>  P rvest    * 1.0.4   2024-02-12 [?] RSPM
    

    Created on 2024-12-10 with reprex v2.1.1