I am trying to get a links of pdfs from a site in R but the rvest read_html() function just sites there, seemingly making no progress.
Here is my code:
# Load required libraries
library(tidyverse)
library(rvest)
# Define the URL
url <- "https://providers.anthem.com/new-york-provider/claims/reimbursement-policies/"
# Read and process the HTML
links <- try({
read_html(url) %>%
html_node(xpath = "/html/body/main/div/div/div/section[3]/div/section/div[1]/section/div[1]/div/div[2]/div/p/a") %>%
html_attr("href") %>%
as_tibble() %>%
rename(url = value)
})
# Display the results with error handling
if(!inherits(links, "try-error")) {
print(links)
} else {
message("Unable to scrape the URL. This might be due to:")
message("- Website requires authentication")
message("- Website blocks automated scraping")
message("- The XPath structure has changed")
message("- Network connectivity issues")
}
Maybe I should do this via httr2?
Here is an error message from xml2::read_html():
> read_html(url)
Error in `open.connection()`:
! cannot open the connection
Hide Traceback
▆
1. ├─xml2::read_html(url)
2. └─xml2:::read_html.default(url)
3. ├─base::suppressWarnings(...)
4. │ └─base::withCallingHandlers(...)
5. ├─xml2::read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options)
6. └─xml2:::read_xml.character(...)
7. └─xml2:::read_xml.connection(...)
8. ├─base::open(x, "rb")
9. └─base::open.connection(x, "rb")
An alternative approach for this URL:
F-12
or some other method, browser-specific), and go to its "Network" tabhttps://providers.anthem.com/new-york-provider/claims/reimbursement-policies
(without a trailing /
)Type == json
downloads.The "Response" content looks promising (named "AllDocs"), and we can see one of the docs has a .pdf
extension (there are other file types).
We can extract that link into R and iterate over the docs. I'll copy the url by right-clicking on the json GET
row and "Copy URL", then go to R
jsonurl <- "https://providers.anthem.com/sites/Satellite?d=Universal&pagename=getdocuments&brand=BCCNYE&state=&formslibrary=gpp_formslib"
alldocs <- httr::GET(jsonurl) |> httr::content()
alldocs2 <- data.frame(URI = unlist(lapply(alldocs[[1]], `[[`, "URI"))) |>
transform(filename = sub(".*/(.*)\\?.*", "\\1", URI)) |>
subset(grepl("pdf$", filename))
head(alldocs2)
# URI filename
# 2 /docs/gpp/NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf?v=202401262130 NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf
# 3 /docs/gpp/NY_ABC_CAID_RP_AssistantatSurgery.pdf?v=202406111406 NY_ABC_CAID_RP_AssistantatSurgery.pdf
# 4 /docs/gpp/NY_ABC_CAID_BH_ConcurrentReviewForm.pdf?v=202312090644 NY_ABC_CAID_BH_ConcurrentReviewForm.pdf
# 5 /docs/gpp/NY_ABC_CAID_BehavioralHealth_QuickRefGuide.pdf?v=202504091758 NY_ABC_CAID_BehavioralHealth_QuickRefGuide.pdf
# 6 /docs/gpp/NY_ABC_CAID_BH_ICRPortalTrainingProviderBrochure.pdf?v=202312312000 NY_ABC_CAID_BH_ICRPortalTrainingProviderBrochure.pdf
# 7 /docs/gpp/NY_ABC_CAID_NotificationofDelivery.pdf?v=202312151000 NY_ABC_CAID_NotificationofDelivery.pdf
(There were .xls
files as well, I'm assuming you don't want/need them. If you want to see everything, remove the subset()
above.)
You can iterate over this frame however you want. One way (downloading just the top 3 here):
ign <- Map(
download.file,
paste0("https://providers.anthem.com", alldocs2$URI[1:3]),
alldocs2$filename[1:3]
)
# trying URL 'https://providers.anthem.com/docs/gpp/NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf?v=202401262130'
# Content type 'application/pdf' length 187139 bytes (182 KB)
# ==================================================
# downloaded 182 KB
# trying URL 'https://providers.anthem.com/docs/gpp/NY_ABC_CAID_RP_AssistantatSurgery.pdf?v=202406111406'
# downloaded 131 KB
# trying URL 'https://providers.anthem.com/docs/gpp/NY_ABC_CAID_BH_ConcurrentReviewForm.pdf?v=202312090644'
# downloaded 144 KB
list.files(pattern = ".*pdf$")
# [1] "NY_ABC_CAID_BH_ConcurrentReviewForm.pdf" "NY_ABC_CAID_RP_AssistantatSurgery.pdf" "NY_ABC_CAID_RP_ProfessionalAnesthesiaServices.pdf"
I did not attempt to download any more than that, so a few disclaimers:
The "terms of use" clearly include
Reproduction, distribution, republication and retransmission of material contained within the Anthem Web Site is prohibited
I'm assuming you are downloading this solely for personal reference.
If they detect and do not like the rapid download of all files, their response might include:
HTTP 429
(too many requests), which is a gently hint that you should likely slow your crawl (i.e., wait and try later, and/or add a sleep between downloads)I haven't QA'd all of the URIs in this response; I blindiy prepended the https://providers..."
string, there's been no validation that this is valid across all of them (though it seems reasonable to me that it should be okay)
There are fields other than "URI"
in each entry, they might be interesting, including states
, title
(which is more human-readable), topic
, etc.