rhttp-redirectweb-scrapingrvestdata-harvest

Link redirection problem - Web Scraping in R using Rvest


While I was web scraping links from a news site using Rvest tools, I often stumbled upon links that redirects to another links. In those cases, I could only scrape the first link, while the second link was the one that actually contained data. For example:

library(dplyr)
library(rvest)
scraped.link <- "http://www1.folha.uol.com.br/folha/dinheiro/ult91u301428.shtml"

article.title <- read_html(scraped.link) %>%
      html_nodes('body') %>%
      html_nodes('.span12.page-content') %>%
      html_nodes('article') %>%
      html_nodes('header') %>%
      html_nodes('h1') %>%
      html_text()
article.title
#> character(0)

redirected.link <- "https://www1.folha.uol.com.br/mercado/2007/06/301428-banco-central-volta-a-intervir-no-mercado-para-deter-queda-do-cambio.shtml"

article.title <- read_html(redirected.link) %>%
      html_nodes('body') %>%
      html_nodes('.span12.page-content') %>%
      html_nodes('article') %>%
      html_nodes('header') %>%
      html_nodes('h1') %>%
      html_text()
article.title
#> "Banco Central volta a intervir no mercado para deter queda do câmbio"

Is there any way to obtain the second link using the first one? The website only retains the first one.


Solution

  • Yes, the page redirects via a javascript `location.replace', so just use a regular expression to extract the first quoted item after the first instance of "location.replace" in the html text of the script tags:

    library(dplyr)
    library(rvest)
    scraped.link <- "http://www1.folha.uol.com.br/folha/dinheiro/ult91u301428.shtml"
    link.regex   <- "(.*?location[.]replace.*?\")(.*?)(\".*)"
    
    read_html(scraped.link)      %>%
      html_nodes('script')       %>%
      html_text()                %>%
      gsub(link.regex, "\\2", .)  
    #> [1] "http://www1.folha.uol.com.br/mercado/2007/06/301428-banco-central-volta-a-intervir-
    #> no-mercado-para-deter-queda-do-cambio.shtml"