htmlrweb-scrapingtext-miningsentiment-analysis

Web scrape hyperlinked text in R?


https://www.nber.org/papers?page=1&perPage=50&sortBy=public_date

The above webpage consists of a series of academic papers. The titles of these papers (e.g, Sparse Modeling Under Grouped Heterogeneity with an Application to Asset Pricing) are hyperlinked to pages with more detail on them; so, if you click on these titles (hyperlinked text) it directs you to pages with more detail.

Is there any way to scrape all these links in R? I would like all the links attached to the titles of the academic papers, not hyperlinks related to other things like people's names. I do not want the titles themselves, just the links they are attached to.


Solution

  • The abstracts and links are loaded dynamically onto the page using an xhr call which fetches a JSON file to populate the html. If you want to get the links quickly and efficiently, you can download the json directly and parse it. You will find the json url using your browser's console.

    Here's a full reprex:

    urls <- "https://www.nber.org/api/v1/working_page_listing/contentType/" |>
      paste0("working_paper/_/_/search?page=1&perPage=50&sortBy=public_date") |>
      httr::GET() |>
      httr::content("parsed") |>
      getElement("results") |>
      sapply(function(x) x$url)
    

    If you want the complete urls, rather than relative ones, simply paste the domain on in front.

    paste0("https://www.nber.org", urls)
    #>  [1] "https://www.nber.org/papers/w31388" "https://www.nber.org/papers/w31424"
    #>  [3] "https://www.nber.org/papers/w31482" "https://www.nber.org/papers/w31477"
    #>  [5] "https://www.nber.org/papers/w31478" "https://www.nber.org/papers/w31479"
    #>  [7] "https://www.nber.org/papers/w31480" "https://www.nber.org/papers/w31481"
    #>  [9] "https://www.nber.org/papers/w31490" "https://www.nber.org/papers/w31502"
    #> [11] "https://www.nber.org/papers/w31486" "https://www.nber.org/papers/w31483"
    #> [13] "https://www.nber.org/papers/w31484" "https://www.nber.org/papers/w31485"
    #> [15] "https://www.nber.org/papers/w31494" "https://www.nber.org/papers/w31489"
    #> [17] "https://www.nber.org/papers/w31496" "https://www.nber.org/papers/w31491"
    #> [19] "https://www.nber.org/papers/w31493" "https://www.nber.org/papers/w31488"
    #> [21] "https://www.nber.org/papers/w31495" "https://www.nber.org/papers/w31497"
    #> [23] "https://www.nber.org/papers/w31498" "https://www.nber.org/papers/w31499"
    #> [25] "https://www.nber.org/papers/w31500" "https://www.nber.org/papers/w31501"
    #> [27] "https://www.nber.org/papers/w31487" "https://www.nber.org/papers/w31503"
    #> [29] "https://www.nber.org/papers/w31476" "https://www.nber.org/papers/w31492"
    #> [31] "https://www.nber.org/papers/w31450" "https://www.nber.org/papers/w31449"
    #> [33] "https://www.nber.org/papers/w31448" "https://www.nber.org/papers/w31453"
    #> [35] "https://www.nber.org/papers/w31451" "https://www.nber.org/papers/w31452"
    #> [37] "https://www.nber.org/papers/w31454" "https://www.nber.org/papers/w31455"
    #> [39] "https://www.nber.org/papers/w31465" "https://www.nber.org/papers/w31458"
    #> [41] "https://www.nber.org/papers/w31459" "https://www.nber.org/papers/w31460"
    #> [43] "https://www.nber.org/papers/w31461" "https://www.nber.org/papers/w31472"
    #> [45] "https://www.nber.org/papers/w31473" "https://www.nber.org/papers/w31475"
    #> [47] "https://www.nber.org/papers/w31474" "https://www.nber.org/papers/w31470"
    #> [49] "https://www.nber.org/papers/w31462" "https://www.nber.org/papers/w31471"
    

    These are all the complete links to the articles on the first page. They are not in the order they appear on the page; I'm unsure whether these are just randomized.

    Created on 2023-07-24 with reprex v2.0.2