I'm trying to scrape some informations (comments, dates, ratings) from this hotel on tripadvisor
Here's my script so far :
import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime
import time
import random
root_url = 'https://www.tripadvisor.ca/Hotel_Review-g60982-d87016-Reviews-or'
urls = [ '{root}{i}-Hilton_Hawaiian_Village_Waikiki_Beach_Resort-Honolulu_Oahu_Hawaii.html'.format(root=root_url, i=i) for i in range(5,20,5) ]
comms = []
notes = []
dates = []
for url in urls:
results = requests.get(url)
#time.sleep(20)
soup = BeautifulSoup(results.text, "html.parser")
commentary = soup.find_all('div', class_='oETBfkHU')
for container in commentary:
comm = container.find('q', class_ = 'IRsGHoPm').text.strip()
comms.append(comm)
date_tag = container.find("span", class_="_355y0nZn").text.strip()
dates.append(date_tag)
data = pd.DataFrame({
'comms' : comms,
'dates' : dates
})
#print(data.head())
data.to_csv('file.csv', sep=';', index=False)
And here's my output :
I'm not surprised, the date_tag
isn't quite well specified but I can't see how to pick the right text.
Here's the html :
The "March 2020" has no class at all so I thought if I specify container.find("span", class_="_355y0nZn").text.strip()
it will work but no it isn't.
And last thing, I don't lnow how to ick the rating, let's see the html :
As you can see, ther eis not text at all. I think the rating work with ui_bubble_rating bubble_50
where 50
is the rating (5). How to scrape that ? I never saw this kind of structure before.
Any ideas ?
Thanks :)
Use the below code to get the date value
date_tag = container.find("div", class_="_1O8E5N17").text
date_text,date_value = str.split(date_tag,':')
I had to use an extra line of coding for converting bs4.element.ResultSet to string
comm1 = str(container.find("div", class_="nf9vGX55").find('span'))
rat = re.findall(r'\d+', str(comm1))
rat1 = (str(rat))[2]
rating.append(rat1)