pythonweb-scrapingbeautifulsouptripadvisor

Python scraping 'things to do' from tripadvisor


From this page, I want to scrape the list 'Types of Things to Do in Miami' (you can find it near the end of the page). Here's what I have so far:

import requests
from bs4 import BeautifulSoup

# Define header to prevent errors
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"

headers = {'User-Agent': user_agent}

new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
# Get response from url
response = requests.get(new_url, headers = headers)
# Encode response for parsing
html = response.text.encode('utf-8')
# Soupify response
soup = BeautifulSoup(html, "lxml")

tag_elements = soup.findAll("a", {"class":"attractions-attraction-overview-main-Pill__pill--23S2Q"})

# Iterate over tag_elements and exctract strings
tags_list = []
for i in tag_elements:
    tags_list.append(i.string)

The problem is, I get values like 'Good for Couples (201)', 'Good for Big Groups (130)', 'Good for Kids (100)' which are from the 'Commonly Searched For in Miami' area of the page which is below the "Types of Things..." part of the page. I also don't get some of the values that I need like "Traveler Resources (7)", "Day Trips (7)" etc. The class names for both these lists "Things to do..." and "Commonly searched..." are same and I'm using class in soup.findAll() which might be the cause of this problem I guess. What is the correct way to do this? Is there some other approach that I should take?


Solution

  • To get only the contents within Types of Things to Do in Miami headers is a little bit tricky. To do so you need to define the selectors in an organized manner like I did below. The following script should click on the See all buton under the aforesaid headers. Once the click is initiated, the script will parse the relevant content you look for:

    from selenium import webdriver
    from selenium.webdriver.support import ui
    from bs4 import BeautifulSoup
    
    driver = webdriver.Chrome()
    wait = ui.WebDriverWait(driver, 10)
    driver.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
    
    show_more = wait.until(lambda driver: driver.find_element_by_css_selector("[class='ui_container'] div:nth-of-type(1) .caret-down"))
    driver.execute_script("arguments[0].click();",show_more)
    soup = BeautifulSoup(driver.page_source,"lxml")
    items = [item.text for item in soup.select("[class='ui_container'] div:nth-of-type(1) a[href^='/Attractions-']")]
    print(items)   
    driver.quit()
    

    The output It produces:

    ['Tours (277)', 'Outdoor Activities (255)', 'Boat Tours & Water Sports (184)', 'Shopping (126)', 'Nightlife (126)', 'Spas & Wellness (109)', 'Fun & Games (67)', 'Transportation (66)', 'Museums (61)', 'Sights & Landmarks (54)', 'Nature & Parks (54)', 'Food & Drink (27)', 'Concerts & Shows (25)', 'Classes & Workshops (22)', 'Zoos & Aquariums (7)', 'Traveler Resources (7)', 'Day Trips (7)', 'Water & Amusement Parks (5)', 'Casinos & Gambling (3)', 'Events (2)']