I am trying to scrape Hockey Reference for a Data Science 101 project. I am running into issues with a particular table. The webpage is:https://www.hockey-reference.com/boxscores/201611090BUF.html. The desired table is under the "Advanced Stats Report (All Situations)". I have tried the following code:
url="https://www.hockey-reference.com/boxscores/201611090BUF.html"
ret <- url %>%
read_html()%>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))]') %>%
html_text()
This code scrapes all data from the tables above, but stops before the advanced table. I have also tried to get more granular with:
url="https://www.hockey-reference.com/boxscores/201611090BUF.html"
ret <- url %>%
read_html()%>%
html_nodes(xpath='//*[(@id = "OTT_adv")]//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))]') %>%
html_text()
which produces a "character(0)" messsage. Any and all help would be appreciated..if its not already clear, I'm fairly new to R. Thanks!
The information you are trying to grab is hidden as a comment on the web page. Here is a solution that needs some work to clean up your final results:
library(rvest)
url="https://www.hockey-reference.com/boxscores/201611090BUF.html"
page<-read_html(url) # parse html
commentedNodes<-page %>%
html_nodes('div.section_wrapper') %>% # select node with comment
html_nodes(xpath = 'comment()') # select comments within node
#there are multiple (3) nodes containing comments
#chose the 2 via trial and error
output<-commentedNodes[2] %>%
html_text() %>% # return contents as text
read_html() %>% # parse text as html
html_elements('table') %>% # select table node
html_table() # parse table and return data.frame # parse table and return data.frame
Output will be a list of 2 elements, one for each table. The player names and stats are repeated multiple times of each option available, thus you will need to clean up this data for your final purpose.