htmlrrvest

Using rvest to scrape HTML Data


I am trying to scrape Hockey Reference for a Data Science 101 project. I am running into issues with a particular table. The webpage is:https://www.hockey-reference.com/boxscores/201611090BUF.html. The desired table is under the "Advanced Stats Report (All Situations)". I have tried the following code:

url="https://www.hockey-reference.com/boxscores/201611090BUF.html"
ret <- url %>%
  read_html()%>%
  html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))]') %>%
  html_text()

This code scrapes all data from the tables above, but stops before the advanced table. I have also tried to get more granular with:

url="https://www.hockey-reference.com/boxscores/201611090BUF.html"
ret <- url %>%
  read_html()%>%
  html_nodes(xpath='//*[(@id = "OTT_adv")]//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))]') %>%
  html_text()

which produces a "character(0)" messsage. Any and all help would be appreciated..if its not already clear, I'm fairly new to R. Thanks!


Solution

  • The information you are trying to grab is hidden as a comment on the web page. Here is a solution that needs some work to clean up your final results:

    library(rvest)
    url="https://www.hockey-reference.com/boxscores/201611090BUF.html"
    
    page<-read_html(url)  # parse html
    
    commentedNodes<-page %>%                   
      html_nodes('div.section_wrapper') %>%  # select node with comment
      html_nodes(xpath = 'comment()')    # select comments within node
    
    #there are multiple (3) nodes containing comments
    #chose the 2 via trial and error
    output<-commentedNodes[2] %>%
      html_text() %>%             # return contents as text
      read_html() %>%             # parse text as html
      html_elements('table') %>%     # select table node
      html_table()                # parse table and return data.frame                # parse table and return data.frame
    

    Output will be a list of 2 elements, one for each table. The player names and stats are repeated multiple times of each option available, thus you will need to clean up this data for your final purpose.