A few weeks ago people in this site helped me with a code to get links, titles, and text from a google search using rvest
. Now, I am trying to use the same code as provided in:
How to retrieve hyperlinks in google search using rvest
How to retrieve text below titles from google search using rvest
And is not working, giving next results:
library(rvest)
library(tidyverse)
#Part 1
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
Result:
# A tibble: 0 x 2
# ... with 2 variables: title <chr>, text <chr>
And second part:
#Part 2
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")
titles %>%
html_elements(xpath = "./parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")
Result:
character(0)
But in the past few weeks ago, this worked. Is it possible to fix this issue?
It looks like Google decided to change their HTML layout, perhaps there were too many of us scrapers.
Here you go:
library(rvest)
library(tidyverse)
#Part 1
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
title <- "//div/div/div/a/div/div/h3/div"
text <- paste0(title, "/parent::h3/parent::div/parent::div/parent::a/parent::div/following-sibling::div/div[1]/div[1]/div[1]/div[1]/div[1]")
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
And part 2:
titles <- html_nodes(first_page, xpath = "//div/div/div/a/div/div/h3/div")
titles %>%
html_elements(xpath = "./parent::h3/parent::div/parent::div/parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")