pythonbeautifulsoupweb-crawlertripadvisor

Want to know how to crawling at tripadvisor


I am trying to get all of the url links of restaurants in Singapore but my code is not working

data = requests.get("https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html").text

soup = BeautifulSoup(data, "html.parser")

for link in soup.find_all('a', {'property_title'}):
    print('https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href'))
    print(link.string)

It keeps on loading and loading again in the code soup = BeautifulSoup(data, "html.parser")

I don't know why this happens even though this works well for other sites.

Is this because trip advisor block crawling or code is wrong?


Solution

  • It keeps on loading and loading again

    To get a response, add the user-agent header:

    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    
    data = requests.get(
        "https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
    ).text
    

    But the data is loaded dynamically, and requests doesn't support dynamically loaded pages. However, the is available in JSON format on the website, (It's not clear what you want to scrape). To get all the data you can use the json/re modules:

    import json
    ...
    
    data = requests.get(
        "https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
    ).text
    
    json_data = re.search(r"window\.__WEB_CONTEXT__=({.*});", data, flags=re.MULTILINE).group(1)
    
    print(
        # Prints all the data, you can use `json.loads` instead to access  the data instead
        json.dumps(json_data, indent=4)
    )
    

    To get all the links:

    import re
    import requests
    
    
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    
    data = requests.get(
        "https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
    ).text
    
    for link in re.findall(r'"detailPageUrl":"(.*?)"', data):
        print("https://www.tripadvisor.com.sg/" + link)
    

    Output (truncated):

    https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1145149-Reviews-Grand_Shanghai_Restaurant-Singapore.html
    https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1193730-Reviews-Entre_Nous_creperie-Singapore.html
    https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1173583-Reviews-The_Courtyard-Singapore.html
    https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d4611806-Reviews-NOX_Dine_in_the_Dark-Singapore.html
    https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d13152787-Reviews-Positano_Risto-Singapore.html