htmlrgoogle-news

Google News in R


I am trying to get info from Google News. This is my code:

library(rvest)
library(tidyverse)


news <- function(term) {

  html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=es-419&gl=US&ceid=US%3Aes-419")) 

  dat <- data.frame(Link = html_dat %>%
                      html_nodes('.VDXfz') %>% 
                      html_attr('href')) %>% 
    mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

  news_dat <- data.frame(
    Title = html_dat %>%
      html_nodes('.DY5T1d') %>% 
      html_text(),
    Link = dat$Link,
    Description =  html_dat %>%
      html_nodes('.Rai5ob') %>% 
      html_text()
  )

  return(news_dat)
}

noticias<-news("coronavirus")

With this code, I retrieve title, link and description. OK. But I need to get 2 fields more: date and media. For example, If a news about a vaccine for the coronavirus has been published yesterday, date will be that. If the media is New York Times, this field will be that. But I don't find these nodes in the HTML. Some idea to fix my code adding these two fields?

Thanks in advance.


Solution

  • Perhaps try this

    news <- function(term) {
      url <- paste0("https://news.google.com/search?q=", term, "&hl=es-419&gl=US&ceid=US:es-419")
      nodeset <- read_html(url) %>% html_nodes("article")
      tibble::tibble(
        Title = nodeset %>% html_nodes("h3") %>% html_text(), 
        Link = nodeset %>% html_nodes("h3 > a") %>% html_attr("href") %>% xml2::url_absolute(url), 
        Description = nodeset %>% html_nodes("div.Da10Tb.Rai5ob > span") %>% html_text(), 
        Source = nodeset %>% html_nodes("div.QmrVtf.RD0gLb.kybdz > div > a") %>% html_text(), 
        Time = nodeset %>% html_nodes("div.QmrVtf.RD0gLb.kybdz > div > time") %>% html_attr("datetime")
      )
    }
    

    Output

    > news("coronavirus")
    
    # A tibble: 100 x 5
       Title                                Link                                           Description                                        Source   Time     
       <chr>                                <chr>                                          <chr>                                              <chr>    <chr>    
     1 India reporta 41.100 casos nuevos d~ https://news.google.com/articles/CBMikwFodHRw~ "NUEVA DELHI (AP) — India reportó el domingo 41.1~ La Voz ~ 2020-11-~
     2 El ecuatoriano Diego Palacios, de L~ https://news.google.com/articles/CBMigwFodHRw~ "El defensa del LAFC, Diego Palacios, se encuentr~ ESPN De~ 2020-11-~
     3 Coronavirus: Austria endurece medid~ https://news.google.com/articles/CAIiEL2L0sxq~ "El canciller Sebastian Kurz pidió a la población~ DW (Esp~ 2020-11-~
     4 ++Coronavirus hoy: Gobierno alemán ~ https://news.google.com/articles/CAIiEKCZppoU~ "\"Todos los países que levantaron sus restriccio~ DW (Esp~ 2020-11-~
     5 ++Coronavirus hoy++ México supera e~ https://news.google.com/articles/CAIiEK8ndryG~ "El COVID-19 se consolidó como la cuarta causa de~ DW (Esp~ 2020-11-~
     6 Coronavirus en Estados Unidos: 5 ci~ https://news.google.com/articles/CAIiEFFHgJgZ~ "La incertidumbre política y la emergencia sanita~ BBC New~ 2020-11-~
     7 México supera el millón de casos de~ https://news.google.com/articles/CBMiRWh0dHBz~ "México sobrepasó el millón de casos confirmados ~ Reuters~ 2020-11-~
     8 Massachusetts reporta 2.800 casos d~ https://news.google.com/articles/CBMiXmh0dHA6~ "Los casos registrados en la más reciente jornada~ El Tiem~ 2020-11-~
     9 ¿Qué hará NYC para resistir una seg~ https://news.google.com/articles/CBMifWh0dHBz~ "Reaccionan políticos locales a la orden de cerra~ NY1 Not~ 2020-11-~
    10 + Coronavirus hoy: Italia suma 544 ~ https://news.google.com/articles/CAIiEJ4KB7k2~ "Argentina registró este sábado (14.11.2020) 8.46~ DW (Esp~ 2020-11-~
    # ... with 90 more rows
    

    Update

    I never thought of cases as follows:

    1. Nested articles.

    nested

    1. Missing the date-time attribute.

    no-date

    I have updated the code to account for all those cases, but the code becomes much less efficient. Anyway, try this:

    news <- function(term) {
      url <- paste0("https://news.google.com/search?q=", term, "&hl=es-419&gl=US&ceid=US:es-419")
      nodeset <- read_html(url) %>% html_nodes("article")
      dplyr::bind_rows(lapply(nodeset, function(x) tibble::tibble(
        Title = x %>% html_node(".ipQwMb.ekueJc.RD0gLb") %>% html_text(), 
        Link = x %>% html_node(".ipQwMb.ekueJc.RD0gLb > a") %>% html_attr("href") %>% xml2::url_absolute(url), 
        Description = x %>% html_node("div.Da10Tb.Rai5ob > span") %>% html_text(), 
        Source = x %>% html_node("div.QmrVtf.RD0gLb.kybdz > div > a") %>% html_text(), 
        Time = x %>% html_node("div.QmrVtf.RD0gLb.kybdz > div > time") %>% html_attr("datetime")
      )))
    }