htmlrweb-scripting

Extract text from dynamic Web page using R


I am working on a data prep tutorial, using data from this article: https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#

None of the text is hard-coded, everything is dynamic and I don't know where to start. I've tried a few things with packages rvest and xml2 but I can't even tell if I'm making progress or not.

I've used copy/paste ang regexes in notepad++ to get a tabular structure like this:

Target Attack
AAA News Fake News
AAA News Fake News
AAA News A total disgrace
... ...
Mr. ZZZ A real nut job

but I'd like to show how to do everything programmatically (no copy/paste).

My main question is as follows: is that even possible with reasonable effort? And if so, any clues on how to get started?

PS: I know that this could be a duplicate, I just can't tell of which question since there are totally different approaches out there :\


Solution

  • Here's a programatic approach with RSelenium and rvest:

    library(RSelenium)
    library(rvest)
    library(tidyverse)
    driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
    client <- driver[["client"]]
    client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
    page.source <- client$getPageSource()[[1]]
    
    #Extract nodes for each letter using XPath
    Letters <- read_html(page.source) %>%
      html_nodes(xpath = '//*[@id="mem-wall"]/div[2]/div') 
    
    #Extract Entities using CSS
    Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
                      html_text)
    
    #Extract quotes using CSS
    Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
                                map(html_nodes, css = 'div.g-twitter-quote-c') %>%
                                map(html_text))
    
    #Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
    map2_dfr(Entities, Quotes,
             ~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
                                                            data.frame(Entity = NA, Insult = NA)}})) -> Result
    
    #Strip out the quotes
    Result %>%
      mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result
    
    #Take a look at the result
    Result %>%
      slice_sample(n=10)
                       Entity                                                              Insult
    1             Mitt Romney                                       failed presidential candidate
    2         Hillary Clinton                                                             Crooked
    3  The “mainstream” media                                                           Fake News
    4               Democrats                                             on a fishing expedition
    5           Pete Ricketts                                             illegal late night coup
    6  The “mainstream” media                                                   anti-Trump haters
    7     The Washington Post do nothing but write bad stories even on very positive achievements
    8               Democrats                                                                weak
    9             Marco Rubio                                                         Lightweight
    10     The Steele Dossier                                                      a Fake Dossier
    

    The xpath was obtained by inspecting the webpage source (F9 in Chrome), hovering over elements until the correct one was highlighted, right clicking, and choosing copy XPath as shown:

    enter image description here