pythonxpathscrapytripadvisor

Scrapy > IndexError: list index out of range


I’m trying to scrape some data of TripAdvisor. I'm interested to get the "Price Range/ Cuisine & Meals" of restaurants.

So I use the following xpath to extract each of this 3 lines in the same class :

response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()').extract()[1]

I'm doing the test directly in scrapy shell and it's working fine :

scrapy shell https://www.tripadvisor.com/Restaurant_Review-g187514-d15364769-Reviews-La_Gaditana_Castellana-Madrid.html

But when I integrate it to my script, I've the following error :

    Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/root/Scrapy_TripAdvisor_Restaurant-master/tripadvisor_las_vegas/tripadvisor_las_vegas/spiders/res_las_vegas.py", line 64, in parse_listing
    (response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
  File "/usr/lib/python3.6/site-packages/parsel/selector.py", line 61, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range

I paste you part of my code and I explain it below :

# extract restaurant cuisine
    row_cuisine_overviewcard = \
    (response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
    row_cuisine_card = \
    (response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
    
    
    if (row_cuisine_overviewcard == "CUISINES"):
        cuisine = \
        response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
    elif (row_cuisine_card == "CUISINES"):
        cuisine = \
        response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
    else:
        cuisine = None

In tripAdvisor restaurants, there is 2 different type of pages, with 2 different format. The first with a class overviewcard, an the second, with a class cards

So I want to check if the first is present (overviewcard), if not, execute the second (card), and if not, put "None" value.

:D But looks like Python execute both .... and as the second one don't exist in the page, the script stop.

Could it be an indentation error ?

Thanks for your help Regards


Solution

  • Your second selector (row_cuisine_card) fails because the element does not exist on the page. When you then try to access [1] in the result it throws an error because the result array is empty.

    Assuming you really want item 1, try this

    row_cuisine_overviewcard = \
    (response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
    # Here we get all the values, even if it is empty.
    row_cuisine_card = \
    (response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()').getall()) 
    
    
    if (row_cuisine_overviewcard == "CUISINES"):
        cuisine = \
        response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
    # Here we check first if that result has more than 1 item, and then we check the value.
    elif (len(row_cuisine_card) > 1 and row_cuisine_card[1] == "CUISINES"):
        cuisine = \
        response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
    else:
        cuisine = None
    

    You should apply the same kind of safety checking whenever you try to get a specific index from a selector. In other words, make sure you have a value before you access it.