Trying to download pdfs in R

I am trying to get a links of pdfs from a site in R but the rvest read_html() function just sites there, seemingly making no progress.

Here is my code:

# Load required libraries
library(tidyverse)
library(rvest)

# Define the URL
url <- "https://providers.anthem.com/new-york-provider/claims/reimbursement-policies/"

# Read and process the HTML
links <- try({
  read_html(url) %>%
    html_node(xpath = "/html/body/main/div/div/div/section[3]/div/section/div[1]/section/div[1]/div/div[2]/div/p/a") %>%
    html_attr("href") %>%
    as_tibble() %>%
    rename(url = value)
})

# Display the results with error handling
if(!inherits(links, "try-error")) {
  print(links)
} else {
  message("Unable to scrape the URL. This might be due to:")
  message("- Website requires authentication")
  message("- Website blocks automated scraping")
  message("- The XPath structure has changed")
  message("- Network connectivity issues")
}

Maybe I should do this via httr2?

Here is an error message from xml2::read_html():

>   read_html(url)
Error in `open.connection()`:
! cannot open the connection
Hide Traceback
    ▆
 1. ├─xml2::read_html(url)
 2. └─xml2:::read_html.default(url)
 3.   ├─base::suppressWarnings(...)
 4.   │ └─base::withCallingHandlers(...)
 5.   ├─xml2::read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options)
 6.   └─xml2:::read_xml.character(...)
 7.     └─xml2:::read_xml.connection(...)
 8.       ├─base::open(x, "rb")
 9.       └─base::open.connection(x, "rb")

Solution

An alternative approach for this URL:

Open a web browser (I'm using FF, but others should work), open the dev-console (perhaps F-12 or some other method, browser-specific), and go to its "Network" tab
Browse to the URL, https://providers.anthem.com/new-york-provider/claims/reimbursement-policies (without a trailing /)
Look for Type == json downloads.

From this, I see

The "Response" content looks promising (named "AllDocs"), and we can see one of the docs has a .pdf extension (there are other file types).

We can extract that link into R and iterate over the docs. I'll copy the url by right-clicking on the json GET row and "Copy URL", then go to R

jsonurl <- "https://providers.anthem.com/sites/Satellite?d=Universal&pagename=getdocuments&brand=BCCNYE&state=&formslibrary=gpp_formslib"
alldocs <- httr::GET(jsonurl) |> httr::content()
alldocs2 <- data.frame(URI = unlist(lapply(alldocs[[1]], `[[`, "URI"))) |>
  transform(filename = sub(".*/(.*)\\?.*", "\\1", URI)) |>
  subset(grepl("pdf$", filename))
head(alldocs2)
#                                                                             URI                                             filename
# 2    /docs/gpp/NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf?v=202401262130    NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf
# 3                /docs/gpp/NY_ABC_CAID_RP_AssistantatSurgery.pdf?v=202406111406                NY_ABC_CAID_RP_AssistantatSurgery.pdf
# 4              /docs/gpp/NY_ABC_CAID_BH_ConcurrentReviewForm.pdf?v=202312090644              NY_ABC_CAID_BH_ConcurrentReviewForm.pdf
# 5       /docs/gpp/NY_ABC_CAID_BehavioralHealth_QuickRefGuide.pdf?v=202504091758       NY_ABC_CAID_BehavioralHealth_QuickRefGuide.pdf
# 6 /docs/gpp/NY_ABC_CAID_BH_ICRPortalTrainingProviderBrochure.pdf?v=202312312000 NY_ABC_CAID_BH_ICRPortalTrainingProviderBrochure.pdf
# 7               /docs/gpp/NY_ABC_CAID_NotificationofDelivery.pdf?v=202312151000               NY_ABC_CAID_NotificationofDelivery.pdf

(There were .xls files as well, I'm assuming you don't want/need them. If you want to see everything, remove the subset() above.)

You can iterate over this frame however you want. One way (downloading just the top 3 here):

ign <- Map(
  download.file,
  paste0("https://providers.anthem.com", alldocs2$URI[1:3]), 
  alldocs2$filename[1:3]
)
# trying URL 'https://providers.anthem.com/docs/gpp/NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf?v=202401262130'
# Content type 'application/pdf' length 187139 bytes (182 KB)
# ==================================================
# downloaded 182 KB
# trying URL 'https://providers.anthem.com/docs/gpp/NY_ABC_CAID_RP_AssistantatSurgery.pdf?v=202406111406'
# downloaded 131 KB
# trying URL 'https://providers.anthem.com/docs/gpp/NY_ABC_CAID_BH_ConcurrentReviewForm.pdf?v=202312090644'
# downloaded 144 KB

list.files(pattern = ".*pdf$")
# [1] "NY_ABC_CAID_BH_ConcurrentReviewForm.pdf"           "NY_ABC_CAID_RP_AssistantatSurgery.pdf"             "NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf"

I did not attempt to download any more than that, so a few disclaimers:

The "terms of use" clearly include

Reproduction, distribution, republication and retransmission of material contained within the Anthem Web Site is prohibited

I'm assuming you are downloading this solely for personal reference.
If they detect and do not like the rapid download of all files, their response might include:
- throttling you (slower bandwidth)
- returning a status code of HTTP 429 (too many requests), which is a gently hint that you should likely slow your crawl (i.e., wait and try later, and/or add a sleep between downloads)
- outright ban your IP for some amount of time
I haven't QA'd all of the URIs in this response; I blindiy prepended the https://providers..." string, there's been no validation that this is valid across all of them (though it seems reasonable to me that it should be okay)
There are fields other than "URI" in each entry, they might be interesting, including states, title (which is more human-readable), topic, etc.