rweb-scrapingrvest

Scraping Google with rvest (2022 layout update)


A few weeks ago people in this site helped me with a code to get links, titles, and text from a google search using rvest. Now, I am trying to use the same code as provided in:

How to retrieve hyperlinks in google search using rvest

How to retrieve text below titles from google search using rvest

And is not working, giving next results:

library(rvest)
library(tidyverse)

#Part 1

url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'

title <- "//div/div/div/a/h3"
text  <- paste0(title, "/parent::a/parent::div/following-sibling::div")

first_page <- read_html(url)

tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
       text = first_page %>% html_nodes(xpath = text) %>% html_text())

Result:

# A tibble: 0 x 2
# ... with 2 variables: title <chr>, text <chr>

And second part:

#Part 2
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")

titles %>%
  html_elements(xpath = "./parent::a") %>%
  html_attr("href") %>%
  str_extract("https.*?(?=&)")

Result:

character(0)

But in the past few weeks ago, this worked. Is it possible to fix this issue?


Solution

  • It looks like Google decided to change their HTML layout, perhaps there were too many of us scrapers.

    Here you go:

    library(rvest)
    library(tidyverse)
    
    #Part 1
    
    url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
    
    title <- "//div/div/div/a/div/div/h3/div"
    text  <- paste0(title, "/parent::h3/parent::div/parent::div/parent::a/parent::div/following-sibling::div/div[1]/div[1]/div[1]/div[1]/div[1]")
    
    first_page <- read_html(url)
    
    tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
           text = first_page %>% html_nodes(xpath = text) %>% html_text())
    

    And part 2:

    titles <- html_nodes(first_page, xpath = "//div/div/div/a/div/div/h3/div")
    
    titles %>%
      html_elements(xpath = "./parent::h3/parent::div/parent::div/parent::a") %>%
      html_attr("href") %>%
      str_extract("https.*?(?=&)")