rweb-scrapinghtml-parsingrvesthtml-content-extraction

Rvest scraping webpage content returned from html_text()


I am attempting to scrape (dynamic?) content from a webpage using the rvest package. I understand that dynamic content should require the use of tools such as Selenium or PhantomJS.

However my experimentation leads me to believe I should still be able to find the content I want using only standard webscraping r packages (rvest,httr,xml2).

For this example I will be using a google maps webpage. Here is the example url...

https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/

If you follow the hyperlink above it will take you to an example webpage. The content I would want in this example are the addresses "920 NC-16, Crumpler, NC 28617" and "2114 NC-16, Newton, NC 28658" in the top left corner of the webpage.

Standard techniques using the css selector or xpath did not work, which initially made sense, as I thought this content was dynamic.

url<-"https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/"
page<-read_html(url)

# The commands below all return {xml nodeset 0}
html_nodes(page,css=".tactile-searchbox-input")
html_nodes(page,css="#sb_ifc50 > input")
html_nodes(page,xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "tactile-searchbox-input", " " ))]')

The commands above all return "{xml nodeset 0}" which I thought was a result of this content being generated dynamically, but here's were my confusion lies, if I convert the whole page to text using html_text() I can find the addresses in the value returned.

html_text(read_html(url))
substring<-substr(x,33561-100,33561+300)

Executing the commands above results in a substring with the following value,

"null,null,null,null,[null,null,null,null,null,null,null,[[[\"920 NC-16, Crumpler, NC 28617\",null,null,null,null,null,null,null,null,null,null,\"Nzm5FTtId895YoaYC4wZqUnMsBJ2rlGI\"]\n,[\"2114 NC-16, Newton, NC 28658\",null,null,null,null,null,null,null,null,null,null,\"RIU-FSdWnM8f-IiOQhDwLoMoaMWYNVGI\"]\n]\n,null,null,0,null,[[null,null,null,null,null,null,null,3]\n,[null,null,null,null,[null,null,null,null,nu"

The substring is very messy but contains the content I need. I've heard parsing webpages using regex is frowned upon but I cannot think of any other way of obtaining this content which would also avoid the use of dynamic scraping tools.

If anyone has any suggestions for parsing the html returned or can explain why I am unable to find the content using xpath or css selectors, but can find it by simply parsing the raw html text, it would be greatly appreciated.

Thanks for your time.


Solution

  • The reason why you can't find the text with Xpath or css selectors is that the string you have found is within the contents of a javascript array object. You were right to assume that the text elements you can see on the screen are loaded dynamically; these are not where you are reading the strings from.

    I don't think there's anything wrong with parsing specific html with regex. I would ensure that I get the full html rather than just the html_text() output, in this case by using the httr package. You can grab the address from the page like this:

    library(httr)
    
    GetAddressFromGoogleMaps <- function(url)
    {
      GET(url)                %>% 
      content("text")         %>%
      strsplit("spotlight")   %>%
      extract2(1)             %>%
      extract(-1)             %>%
      strsplit("[[]{3}(\")*") %>%
      extract2(1)             %>%
      extract(2)              %>%
      strsplit("\"")          %>%
      extract2(1)             %>%
      extract(1)
    }
    

    Now:

    GetAddressFromGoogleMaps(url)
    #[1] "920 NC-16, Crumpler, NC 28617, USA"