pythonseleniumtripadvisor

Scraping Hotel Info by using the existing list of urls in csv file


I have scraped urls of 3 hotel information pages from TripAdvisor and stored in a csv file. After importing the csv file, I have to use these 3 urls to scrape each hotel name, get the price range of each hotel and their hotel class. The tool of Selenium is used.

Here is my code. When using the URL of single hotel, I can scrape the name of hotel. However, when it comes to a lot of hotels to scrape, it doesn't work. It seems there are problems in "for" loop.

!pip install selenium

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import csv
from time import sleep
from time import time
from random import randint

browser = webdriver.Chrome(executable_path= 'C:\ProgramData\Anaconda3\Lib\site-packages\jupyterlab\chromedriver.exe')
result_list=[]

def start_request(q):
   r = browser.get(q)
   print("crlawling "+q)
   return r

def parse(text):
   container1 = browser.find_elements_by_xpath('//*[@id="taplc_hotel_review_atf_hotel_info_web_component_0"]')
   mydict = {}

   for results in container1:
        try:
            mydict['name'] = results.find_element_by_xpath('//*[@id="HEADING"]')

         except Exception as e:
            print(e)
            print('not____________________________found')
            mydict['name'] = 'null'
            result_list.append(mydict)

with open('Best3HotelsLink.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
          req = row['Link']
          text = start_request(req)
          parse(text)
          sleep(randint(1,3))

import pandas as pd
df = pd.DataFrame(result_list)
df.to_csv('Detailed Hotelinfo.csv')
df

I also have tried to scrape the hotel class and the price range of the hotels, but in vain. Hotel Class Price Range

I would like to seek your advice on how to fix the above problems. Many thanks.


Solution

  • if you have lot informations to scrape i suggest you to reload informations each time:

    try this code:

    def parse(text):
       time.sleep(2)   # i suggzest you to add some time to wait to load the page
       container1 = browser.find_elements_by_xpath('//*[@id="taplc_hotel_review_atf_hotel_info_web_component_0"]')
       nbrcontainer = len(container1)
       mydict = {}
    
       for i in range(0, nbrcontainer):
            container1 = browser.find_elements_by_xpath('//*[@id="taplc_hotel_review_atf_hotel_info_web_component_0"]')
            results = container1[i]
            try:
                mydict['name'] = results.find_element_by_xpath('//*[@id="HEADING"]')
    
             except Exception as e:
                print(e)
                print('not____________________________found')
                mydict['name'] = 'null'
                result_list.append(mydict)