Scraping PDF or pdf text when the file is embedded in html

Using R, I am trying to get the text (ideally, with some formatting) of a pdf embedded in html. THe url, as an example, is "https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf"

using pdf_text doesn't work:

> pdf_text <- pdf_text("https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf")
Error in open.connection(con, "rb") : 
  cannot open the connection to 'https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf'
In addition: Warning message:
In open.connection(con, "rb") :
  cannot open URL 'https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf': HTTP status was '403 Forbidden'

I've also tried using RSelenium to navigate to the page and glean anything from the html, with no luck:

> remDr$navigate("https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf")
> pageHTML <- remDr$getPageSource()[[1]]
> pageHTML
[1] "<html><head></head><body style=\"height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(82, 86, 89);\"><embed name=\"843DE9299AC47C3596F8B8E1296AD1FC\" style=\"position:absolute; left: 0; top: 0;\" width=\"100%\" height=\"100%\" src=\"about:blank\" type=\"application/pdf\" internalid=\"843DE9299AC47C3596F8B8E1296AD1FC\"></body></html>"

If it's not possible to just get the text, I'd be happy to download the pdf automatically and then pdf_text the file, but I have not been able to run Rselenium to do that.

Solution

To open a remote PDF in any viewer it needs to be collected from remote and decompressed into local screen pixels. This is done using browsers, so the PDF is downloaded first then returned to the browser glass window as pixels.

Exactly the same function from a device command line is

curl -A "Mozilla ()/20100101 Firefox/81.0" -O https://www.nycourts.gov/courts/ad2/Handdowns/2024/10-October/10-02-2024_FINAL_HANDDOWN_LIST.pdf & 10-02-2024_FINAL_HANDDOWN_LIST.pdf

Once you have that binary data you can export the hyperlinks using any suitable shell function so coherent cpdf has a simple output for Json format, here using windows find filter.

cpdf -list-annotations-json 10-02-2024_FINAL_HANDDOWN_LIST.pdf |find "https"

We can then at that time loop back so the curl will download each of those PDF links same as the first one.

Alternatively you can try similar in PDF.JS web scrappers by search for the PDF internals then run as a list to extract the separate PDF references from the decrypted PDF but that is a Mozilla not Chrome ability.