rrvesthttr2

Trying to download pdfs in R


I am trying to get a links of pdfs from a site in R but the rvest read_html() function just sites there, seemingly making no progress.

Here is my code:

# Load required libraries
library(tidyverse)
library(rvest)

# Define the URL
url <- "https://providers.anthem.com/new-york-provider/claims/reimbursement-policies/"

# Read and process the HTML
links <- try({
  read_html(url) %>%
    html_node(xpath = "/html/body/main/div/div/div/section[3]/div/section/div[1]/section/div[1]/div/div[2]/div/p/a") %>%
    html_attr("href") %>%
    as_tibble() %>%
    rename(url = value)
})

# Display the results with error handling
if(!inherits(links, "try-error")) {
  print(links)
} else {
  message("Unable to scrape the URL. This might be due to:")
  message("- Website requires authentication")
  message("- Website blocks automated scraping")
  message("- The XPath structure has changed")
  message("- Network connectivity issues")
}

Maybe I should do this via httr2?

Here is an error message from xml2::read_html():

>   read_html(url)
Error in `open.connection()`:
! cannot open the connection
Hide Traceback
    ▆
 1. ├─xml2::read_html(url)
 2. └─xml2:::read_html.default(url)
 3.   ├─base::suppressWarnings(...)
 4.   │ └─base::withCallingHandlers(...)
 5.   ├─xml2::read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options)
 6.   └─xml2:::read_xml.character(...)
 7.     └─xml2:::read_xml.connection(...)
 8.       ├─base::open(x, "rb")
 9.       └─base::open.connection(x, "rb")

Solution

  • An alternative approach for this URL:

    From this, I see screenshot of firefox with dev-console open, network traffic showing a json download and its response/content

    The "Response" content looks promising (named "AllDocs"), and we can see one of the docs has a .pdf extension (there are other file types).

    We can extract that link into R and iterate over the docs. I'll copy the url by right-clicking on the json GET row and "Copy URL", then go to R

    jsonurl <- "https://providers.anthem.com/sites/Satellite?d=Universal&pagename=getdocuments&brand=BCCNYE&state=&formslibrary=gpp_formslib"
    alldocs <- httr::GET(jsonurl) |> httr::content()
    alldocs2 <- data.frame(URI = unlist(lapply(alldocs[[1]], `[[`, "URI"))) |>
      transform(filename = sub(".*/(.*)\\?.*", "\\1", URI)) |>
      subset(grepl("pdf$", filename))
    head(alldocs2)
    #                                                                             URI                                             filename
    # 2    /docs/gpp/NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf?v=202401262130    NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf
    # 3                /docs/gpp/NY_ABC_CAID_RP_AssistantatSurgery.pdf?v=202406111406                NY_ABC_CAID_RP_AssistantatSurgery.pdf
    # 4              /docs/gpp/NY_ABC_CAID_BH_ConcurrentReviewForm.pdf?v=202312090644              NY_ABC_CAID_BH_ConcurrentReviewForm.pdf
    # 5       /docs/gpp/NY_ABC_CAID_BehavioralHealth_QuickRefGuide.pdf?v=202504091758       NY_ABC_CAID_BehavioralHealth_QuickRefGuide.pdf
    # 6 /docs/gpp/NY_ABC_CAID_BH_ICRPortalTrainingProviderBrochure.pdf?v=202312312000 NY_ABC_CAID_BH_ICRPortalTrainingProviderBrochure.pdf
    # 7               /docs/gpp/NY_ABC_CAID_NotificationofDelivery.pdf?v=202312151000               NY_ABC_CAID_NotificationofDelivery.pdf
    

    (There were .xls files as well, I'm assuming you don't want/need them. If you want to see everything, remove the subset() above.)

    You can iterate over this frame however you want. One way (downloading just the top 3 here):

    ign <- Map(
      download.file,
      paste0("https://providers.anthem.com", alldocs2$URI[1:3]), 
      alldocs2$filename[1:3]
    )
    # trying URL 'https://providers.anthem.com/docs/gpp/NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf?v=202401262130'
    # Content type 'application/pdf' length 187139 bytes (182 KB)
    # ==================================================
    # downloaded 182 KB
    # trying URL 'https://providers.anthem.com/docs/gpp/NY_ABC_CAID_RP_AssistantatSurgery.pdf?v=202406111406'
    # downloaded 131 KB
    # trying URL 'https://providers.anthem.com/docs/gpp/NY_ABC_CAID_BH_ConcurrentReviewForm.pdf?v=202312090644'
    # downloaded 144 KB
    
    list.files(pattern = ".*pdf$")
    # [1] "NY_ABC_CAID_BH_ConcurrentReviewForm.pdf"           "NY_ABC_CAID_RP_AssistantatSurgery.pdf"             "NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf"
    

    I did not attempt to download any more than that, so a few disclaimers: