I am using rvest
R
package to scrape a PDF file from this webpage but the final link is exposed (as a bitstream url - whatever it is) after I click on the exposed url
by name AC1-96-21-01-2011.pdf
. The final pdf file is tucked in here hidden from access. This blocks all attempts of rvest
function read_html()
as the final pdf file opens only on clicking on the previous link (on href
). Copy pasting the xml node
that is not allowing me to enter into the pdf file.
<a href="/judgments/handle/123456789/701">Arbitration Case - AC</a>
The final file is on this url which is not exposed in the href
node.
http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf
So as a summary how do I access the pdf file link using rvest
that is not found in the href
attribute as explained above.
I tried to search bitstream
but it takes my to something else.
You're looking at the wrong node I think:
library(rvest)
"http://judgmenthck.kar.nic.in/judgments/handle/123456789/563560" %>%
read_html() %>%
html_nodes(xpath = "//td/a[@target='_blank']") %>%
html_attr("href") %>%
unique() %>%
{grep("[.]pdf", ., value = T)} %>%
paste0("http://judgmenthck.kar.nic.in", .) ->
pdf_url
print(pdf_url)
# [1] "http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf"