python-3.xweb-scrapingxpathscrapyscrapy-shell

(xPathHelp) Scrapy not going to next page only scrapes first page


Im trying to extract the nested url in href Here so I can create a "next page" xpath selector for my spider but I can't figure out the right location path to it.
I've been testing my code in the Scrapy shell environment

Here's my spider source code - using python3

import scrapy class StarbucksSpider(scrapy.Spider):
name = 'starbucks'
allowed_domains = ['gameflip.com/shop/gift-cards/starbucks']
start_urls = ['https://gameflip.com/shop/gift-cards/starbucks?limit=36&platform=starbucks&accept_currency=USD&status=onsale']

def parse(self, response):
    
    slots = response.xpath('//*[@class="listing-detail view-grid col-6 col-md-4 col-lg-3 col-xl-2"]')

    for slot in slots:

        fullPrice = slot.xpath('.//*[@class="col-12 description normal"]/text()').extract_first()
        Discount = slot.xpath('.//*[@class="badge badge-success listing-discount"]/text()').extract_first()
        price = slot.xpath('.//*[@class="money"]/text()').extract_first()
        status = slot.xpath('.//*[@alt="sold"]/@alt').extract_first()

        print ('\n')
        print (status)
        print (fullPrice)
        print (Discount)
        print (price)
        print ('\n')

        next_PageUrl = response.xpath('//*[@class="btn"]/@href').extract_first()
        absoulute_next_page_url = response.urljoin(next_PageUrl)
        yield scrapy.Request(absoulute_next_page_url)

Please don't hesitate asking me questions to better assist your answer. Any help is appreciated ;D

Thank you for your time and answers!


Solution

  • As mentioned in the comment from jwjhdev, the content is coming from an API. You can see this in the network tab in the dev tools of your browser when reloading the page. The url of the API can be modified to give you more or less objects per page. If we increase to 150 in your case, we only get one page of data which means one request: https://production-gameflip.fingershock.com/api/v1/listing?limit=150&kind=item&category=GIFTCARD&platform=starbucks&status=onsale&sort=_score:desc,shipping_within_days:asc,created:desc&accept_currency=USD

    So instead of getting the data from the webpage and having to use xpaths, we can query the api and get structured data that is easier to manipulate.

    I've modified your spider code slightly below to show how we can get data from the API. We'll use the API url above as one of the start_urls However I noticed that the discount wasn't in the response so I believe that you will have to calculate it in the code.

    import json
    import scrapy
    
    
    class StarbucksSpider(scrapy.Spider):
        name = 'starbucks'
        start_urls = [
            'https://production-gameflip.fingershock.com/api/v1/listing?limit=150&kind=item&category=GIFTCARD&platform=starbucks&status=onsale&sort=_score:desc,shipping_within_days:asc,created:desc&accept_currency=USD'
        ]
    
    
        def parse(self, response):
            api_data = json.loads(response.text)
            slots = api_data.get('data')
    
            for slot in slots:
                fullPrice = slot.get('name')
                # I couldn't find in the json the discount
                Discount = 'TODO - Calculate discount using values from API'
                price = slot.get('price')
                status = slot.get('status')
    
                print('\n')
                print(status)
                print(fullPrice)
                print(Discount)
                print(price)
                print('\n')