pythonpython-3.xweb-scrapingbeautifulsoup

How to scrape the rating and the Date on tripadvisor with BeautifulSoup


I'm trying to scrape some informations (comments, dates, ratings) from this hotel on tripadvisor

Here's my script so far :

import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime
import time
import random


root_url = 'https://www.tripadvisor.ca/Hotel_Review-g60982-d87016-Reviews-or'
urls = [ '{root}{i}-Hilton_Hawaiian_Village_Waikiki_Beach_Resort-Honolulu_Oahu_Hawaii.html'.format(root=root_url, i=i) for i in range(5,20,5) ]

comms = []
notes = []
dates = []


for url in urls: 
    results = requests.get(url)

    #time.sleep(20)

    soup = BeautifulSoup(results.text, "html.parser")

    commentary = soup.find_all('div', class_='oETBfkHU')

    for container in commentary:

        comm  = container.find('q', class_ = 'IRsGHoPm').text.strip()

        comms.append(comm)


        date_tag = container.find("span", class_="_355y0nZn").text.strip()

        dates.append(date_tag)

data = pd.DataFrame({
    'comms' : comms,
    'dates' : dates
    })



#print(data.head())
data.to_csv('file.csv', sep=';', index=False)

And here's my output :

output

I'm not surprised, the date_tag isn't quite well specified but I can't see how to pick the right text.

Here's the html :

date

The "March 2020" has no class at all so I thought if I specify container.find("span", class_="_355y0nZn").text.strip() it will work but no it isn't.

And last thing, I don't lnow how to ick the rating, let's see the html :

rating

As you can see, ther eis not text at all. I think the rating work with ui_bubble_rating bubble_50 where 50 is the rating (5). How to scrape that ? I never saw this kind of structure before.

Any ideas ?

Thanks :)


Solution

  • Use the below code to get the date value

    date_tag = container.find("div", class_="_1O8E5N17").text 
    date_text,date_value = str.split(date_tag,':')
    

    I had to use an extra line of coding for converting bs4.element.ResultSet to string

    comm1 = str(container.find("div", class_="nf9vGX55").find('span'))
    rat = re.findall(r'\d+', str(comm1))
    rat1 = (str(rat))[2]
    rating.append(rat1)