rscreen-scrapingrvest

R Glassdoor Web Scraping


I have been tasked to collect Glassdoor Reviews for different hospitals and I am having difficulties extracting the Pros, Cons, Advice to management, Recommend, CEO Approval, Business Outlook, and the small rating drop down. I have been able to extract the rest from the code below. Any help would be greatly appreciated.

library(rvest)
library(tidyverse)
library(tidyverse)
library(stringr)

   url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?            sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
   page <- read_html(url)

# Extract review titles

review_titles <- page %>%
html_nodes(".reviewLink") %>%
html_text()

# Extract review dates

review_dates <- page %>%
html_nodes(".middle.common__EiReviewDetailsStyle__newGrey") %>%
html_text()

#Extract Pros
review_pros <- page %>%
html_nodes("v2__EIReviewDetailsV2__fullWidth ") %>%
html_text()
print(review_pros)

# Extract review ratings

review_ratings <- page %>%
html_nodes(".ratingNumber.mr-xsm") %>%
html_text() %>%
str_extract("\d+") %>%
as.integer()

# Extract review recommendations

recommendations <- page %>%
html_nodes("html body.main.loggedIn.lang-en.en-US.gdGrid._initOk div#Container div.container-max-width.mx-auto.px-0.px-lg-lg.py-lg-xxl div.d-flex.row.css-zwxlu7.e1af7d9i0 main.col-12.mb-lg-0.mb-md.css-yaeagj.ej1dgw00 div#ReviewsRef div#ReviewsFeed ol.empReviews.emp-reviews-feed.pl-0 li#empReview_76309432.noBorder.empReview.cf.pb-0.mb-0 div.p-0.mb-0.mb-md-std.css-w5wad1.gd-ui-module.css-rntt2a.ec4dwm00 div.gdReview div.mt-xxsm div.mx-0 div.px-std div div.d-flex.my-std.reviewBodyCell.recommends.css-1y3jl3a.e1868oi10") %>%
html_text()

# Convert recommendations to numeric values

recommendations_numeric <- ifelse(grepl("css-hcqxoa-svg", recommendations), 1,
ifelse(grepl("css-1y3jl3a-svg", recommendations), -1, 0))

# Create data frame

reviews <- data.frame(Title = review_titles, Rating = review_ratings, Date = review_dates)

# View data frame

reviews

Solution

  • The data you are looking for is stored in a script. This answer is based on a similar question. Web scraping data that is not displayed on a webpage using rvest

    It took a while searching and trial and error to get it correct. In the script there is a section that starts with "reviews": and ends with }]}. In this case it was after the second occurrence of reviews. It is a matter of extracting out this part and converting from JSON.

    library(stringr) 
    library(xml2)
    library(rvest) 
    library(dplyr)
    
    url <- "https://www.glassdoor.com/Reviews/Montefiore-Nyack-Hospital-Reviews-E2312619.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
    page <- read_html(url)
    
    
    #the ratings are stored in a data structure in a script
    #find all the scripts and then search
    scripts<-page %>% html_elements(xpath='//script')
    
    #search the scripts for the ratings
    ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
    
    #Extract text for the reviews from the script.  this is the second reviews section This is almost valid JSON format
    reviews <-scripts[ratingsScript] %>% html_text2() %>% 
       str_extract("\"reviews\":.+?\\}\\]\\}") %>% substring(10) %>% str_extract("\"reviews\":.+?\\}\\]\\}") 
    nchar(reviews)  #debugging status
    
    #add a leading { to make valid JSON and convert
    answer <-jsonlite::fromJSON(paste("{", reviews))
    answer[ , c(ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook)]
    

    There is a lot of potentially useful information in the answer data frame. Job status, comments, reviewers id, star reviews, etc.