python-2.7web-scrapingbeautifulsouptripadvisor

For Loop trying to scrape TripAdvisor Restaurant data


I am trying to scrape a list of all the restaurants in Hong Kong and their corresponding URLs. Currently, in my code below, I am able to scrape the 1st and 2nd pages. But I want my for loop towards the bottom to be a bit more dynamic and keep scraping until it hits the amount of entries I specified in range().

I am still a novice at this so any help would be awesome.

#import libraries
import requests
from bs4 import BeautifulSoup
import csv


#scrape the first page because this URL is different then when you start moving to different pages
url0 = 'https://www.tripadvisor.com/Restaurants-g294217-Hong_Kong.html#EATERY_LIST_CONTENTS'
r = requests.get(url0)
data = r.text
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.findAll('a', {'property_title'}):
    print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
    print link.string

#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
    entries = str(30)
    #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
    url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + entries + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
    r1 = requests.get(url1)
    data1 = r1.text
    soup1 = BeautifulSoup(data1, "html.parser")
    for link in soup1.findAll('a', {'property_title'}):
        print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
        print link.string
    break

Solution

  • Ended up adding a while that got it to loop the way I wanted it to. Hope this helps people in the future

    for i in range(30, 120, 30):
        while i <= range:
            i = str(i)
            #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
            url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
            r1 = requests.get(url1)
            data1 = r1.text
            soup1 = BeautifulSoup(data1, "html.parser")
            for link in soup1.findAll('a', {'property_title'}):
                print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
                print link.string
            break