I'm currently trying to extract the hotel reviews (body and title) along with the the rating and date of stay. I've come across @Driftr95's brilliant solution for scraping data about TripAdvisor attractions, and executed with no issues.
Original URL (Attraction): https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html
However, when I replace the URL with a hotel rather than an attraction, the code fails to return the right data.
Desired URL (Hotel): https://www.tripadvisor.com/Hotel_Review-g186405-d215170-Reviews-Portreeves-Arundel_Arun_District_West_Sussex_England.html
As such, I've modified the original code to reflect the new nxt_pg_sel and review_sel containers as follows. However, the bubble rating and review date in the resulting data frame are blank.
nxt_pg_sel = 'a.next'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
'from_page': ('', '"staticVal"'),
'profile_name': 'span>a[href^="\/Profile\/"]',
'profile_link': ('span>a[href^="\/Profile\/"]', 'href'),
'about_reviewer': 'span:has(>a[href^="\/Profile\/"])+div',
'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
'review_link': ('a[href^="\/ShowUserReviews-"]', 'href'),
'review_title': 'a[href^="\/ShowUserReviews-"]',
'about_review': 'div:has(>a[href^="/ShowUserReviews-"])+div:not(:has(div))',
'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
}
I also attempted to capture the stay_date which broke the code entirely.
'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
I've tried looking at the dynamic tags within the new URL but cannot seem to figure out why these elements are not being scrapped. Would appreciate any suggestions. Many many thanks !
PS the original code that works flawlessly (but for attractions) is attached below.
##selectforlist
def selectForList(tagSoup, selectors, printList=False):
if isinstance(selectors, dict):
return dict(zip(selectors.keys(), selectForList(
tagSoup, selectors.values(), printList)))
selGen = (( list(sel if isinstance(sel, (tuple, list)) ## generate params
else [sel])+[None]*2 )[:3] for sel in selectors)
returnList = [ sel[0] if sel[1] == '"staticVal"' ## [allows placeholders]
else selectGet(tagSoup, *sel) for sel in selGen ]
if printList and not isinstance(printList,str): print(returnList)
if isinstance(printList,str): print(*returnList, sep=printList)
return returnList
##original selectors
nxt_pg_sel = 'a[href][data-smoke-attr="pagination-next-arrow"]'
review_sel = 'div[data-automation="reviewCard"]'
rev_dets_sel = {
'from_page': ('', '"staticVal"'),
'profile_name': 'span>a[href^="\/Profile\/"]',
'profile_link': ('span>a[href^="\/Profile\/"]', 'href'),
'about_reviewer': 'span:has(>a[href^="\/Profile\/"])+div',
'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
'review_link': ('a[href^="\/ShowUserReviews-"]', 'href'),
'review_title': 'a[href^="\/ShowUserReviews-"]',
'about_review': 'div:has(>a[href^="/ShowUserReviews-"])+div:not(:has(div))',
'review_body': 'div:has(>a[href^="/ShowUserReviews-"])~div>div',
'review_date': 'div:has(>a[href^="/ShowUserReviews-"])~div:last-child>div',
}
##set variables
csv_fn_revs = 'Scudamore_s_Punting_Company-tripadvisor_reviews.csv'
csv_fn_pgs = 'Scudamore_s_Punting_Company-tripadvisor_review_pages.csv'
pgNum, maxPages = 0, None
pageUrl = 'https://www.tripadvisor.co.uk/Attraction_Review-g186225-d213774-Reviews-Scudamore_s_Punting_Company-Cambridge_Cambridgeshire_England.html'
##scrape data using web driver
browser = webdriver.Chrome()
browser.maximize_window() # maximize window
reveiws_list, pgList = [], []
while pageUrl and (maxPages is None or pgNum < maxPages):
pgNum += 1
pgList.append({'page': pgNum, 'URL': pageUrl})
try:
browser.get(pageUrl)
rev_dets_sel['from_page'] = (pgNum, '"staticVal"')
pgSoup = BeautifulSoup(browser.page_source, 'html.parser')
rev_cards = pgSoup.select(review_sel)
reveiws_list += [selectForList(r, rev_dets_sel) for r in rev_cards]
pgList[-1]['reviews'] = len(rev_cards)
next_page = pgSoup.select_one(nxt_pg_sel)
if next_page:
pageUrl = 'https://www.tripadvisor.co.uk' + next_page.get('href')
pgList[-1]['next_page'] = pageUrl
print('going to', pageUrl)
else:
pageUrl = None # stop condition
except Exception as e:
print(f'Stopping on pg{pgNum} due to {type(e)}:\n{e}')
break
browser.quit() # Close the browser
# Save as csv
pd.DataFrame(reveiws_list).to_csv(csv_fn_revs, index=False)
pd.DataFrame(pgList).to_csv(csv_fn_pgs, index=False)
I also attempted to capture the stay_date which broke the code entirely.
Did it raise an error? Could you elaborate on the error? [If it just generates irrelevant data, then that's within normal behavior since that selector now leads to something else.]
(Also, I'm actually pleasantly surprised by how many of the old selectors still work - you often have to figure out an entirely new set of selectors for a new type of page...)
For the bubbles
, I suggest now using
'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),
and for the stay_date
, try
'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',
overall, the new set of selectors I'd suggest using for hotels:
nxt_pg_sel = 'a.next[href]' # '[data-smoke-attr="pagination-next-arrow"]'
# review_sel = 'div[data-automation="reviewCard"]'
review_sel = 'div[data-test-target="HR_CC_CARD"]'
rev_dets_sel = {
'from_page': ('', '"staticVal"'),
'profile_name': 'span>a[href^="\/Profile\/"]',
'profile_link': ('span>a[href^="\/Profile\/"]', 'href'),
# 'about_reviewer': 'span:has(>a[href^="\/Profile\/"])+div',
'about_reviewer': 'a.ui_social_avatar+div>div+div+div',
# 'review_votes': 'button[aria-label="Click to add helpful vote"]>span',
# 'bubbles': ('svg[aria-label$=" of 5 bubbles"]', 'aria-label'),
'bubbles': ('div[data-test-target="review-rating"]>span', 'class'),
'review_link': ('a[href^="\/ShowUserReviews-"]', 'href'),
'review_title': 'a[href^="\/ShowUserReviews-"]',
# 'about_review': 'div:has(>a[href^="\/ShowUserReviews-"])+div:not(:has(div))',
'review_body': 'div:has(>a[href^="\/ShowUserReviews-"])~div>div',
# 'review_date': 'div:has(>a[href^="\/ShowUserReviews-"])~div:last-child>div',
'stay_date': 'div[data-test-target="review-title"]+div>div:nth-child(2)>span:first-child',
}
(I've uploaded my results to the same spreadsheet as before.)