While I was web scraping links from a news site using Rvest
tools, I often stumbled upon links that redirects to another links. In those cases, I could only scrape the first link, while the second link was the one that actually contained data. For example:
library(dplyr)
library(rvest)
scraped.link <- "http://www1.folha.uol.com.br/folha/dinheiro/ult91u301428.shtml"
article.title <- read_html(scraped.link) %>%
html_nodes('body') %>%
html_nodes('.span12.page-content') %>%
html_nodes('article') %>%
html_nodes('header') %>%
html_nodes('h1') %>%
html_text()
article.title
#> character(0)
redirected.link <- "https://www1.folha.uol.com.br/mercado/2007/06/301428-banco-central-volta-a-intervir-no-mercado-para-deter-queda-do-cambio.shtml"
article.title <- read_html(redirected.link) %>%
html_nodes('body') %>%
html_nodes('.span12.page-content') %>%
html_nodes('article') %>%
html_nodes('header') %>%
html_nodes('h1') %>%
html_text()
article.title
#> "Banco Central volta a intervir no mercado para deter queda do câmbio"
Is there any way to obtain the second link using the first one? The website only retains the first one.
Yes, the page redirects via a javascript `location.replace', so just use a regular expression to extract the first quoted item after the first instance of "location.replace" in the html text of the script tags:
library(dplyr)
library(rvest)
scraped.link <- "http://www1.folha.uol.com.br/folha/dinheiro/ult91u301428.shtml"
link.regex <- "(.*?location[.]replace.*?\")(.*?)(\".*)"
read_html(scraped.link) %>%
html_nodes('script') %>%
html_text() %>%
gsub(link.regex, "\\2", .)
#> [1] "http://www1.folha.uol.com.br/mercado/2007/06/301428-banco-central-volta-a-intervir-
#> no-mercado-para-deter-queda-do-cambio.shtml"