rrvestbitstreampdftools

how to land up on the bitstream url from the href link of an html


I am using rvest R package to scrape a PDF file from this webpage but the final link is exposed (as a bitstream url - whatever it is) after I click on the exposed url by name AC1-96-21-01-2011.pdf. The final pdf file is tucked in here hidden from access. This blocks all attempts of rvest function read_html() as the final pdf file opens only on clicking on the previous link (on href). Copy pasting the xml node that is not allowing me to enter into the pdf file.

<a href="/judgments/handle/123456789/701">Arbitration Case - AC</a>

The final file is on this url which is not exposed in the href node. http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf

So as a summary how do I access the pdf file link using rvest that is not found in the href attribute as explained above.

I tried to search bitstream but it takes my to something else.


Solution

  • You're looking at the wrong node I think:

    library(rvest)
    
    "http://judgmenthck.kar.nic.in/judgments/handle/123456789/563560" %>%
    read_html()                                                       %>%
    html_nodes(xpath = "//td/a[@target='_blank']")                    %>%
    html_attr("href")                                                 %>% 
    unique()                                                          %>% 
    {grep("[.]pdf", ., value = T)}                                    %>%
    paste0("http://judgmenthck.kar.nic.in", .)                         ->
    pdf_url
    
    print(pdf_url)
    # [1] "http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf"