Im trying to extract the nested url in href Here so I can create a "next page" xpath selector for my spider but I can't figure out the right location path to it.
I've been testing my code in the Scrapy shell environment
Here's my spider source code - using python3
import scrapy class StarbucksSpider(scrapy.Spider):
name = 'starbucks'
allowed_domains = ['gameflip.com/shop/gift-cards/starbucks']
start_urls = ['https://gameflip.com/shop/gift-cards/starbucks?limit=36&platform=starbucks&accept_currency=USD&status=onsale']
def parse(self, response):
slots = response.xpath('//*[@class="listing-detail view-grid col-6 col-md-4 col-lg-3 col-xl-2"]')
for slot in slots:
fullPrice = slot.xpath('.//*[@class="col-12 description normal"]/text()').extract_first()
Discount = slot.xpath('.//*[@class="badge badge-success listing-discount"]/text()').extract_first()
price = slot.xpath('.//*[@class="money"]/text()').extract_first()
status = slot.xpath('.//*[@alt="sold"]/@alt').extract_first()
print ('\n')
print (status)
print (fullPrice)
print (Discount)
print (price)
print ('\n')
next_PageUrl = response.xpath('//*[@class="btn"]/@href').extract_first()
absoulute_next_page_url = response.urljoin(next_PageUrl)
yield scrapy.Request(absoulute_next_page_url)
Please don't hesitate asking me questions to better assist your answer. Any help is appreciated ;D
Thank you for your time and answers!
As mentioned in the comment from jwjhdev
, the content is coming from an API.
You can see this in the network tab in the dev tools of your browser when reloading the page.
The url of the API can be modified to give you more or less objects per page.
If we increase to 150
in your case, we only get one page of data which means one request: https://production-gameflip.fingershock.com/api/v1/listing?limit=150&kind=item&category=GIFTCARD&platform=starbucks&status=onsale&sort=_score:desc,shipping_within_days:asc,created:desc&accept_currency=USD
So instead of getting the data from the webpage and having to use xpaths, we can query the api and get structured data that is easier to manipulate.
I've modified your spider code slightly below to show how we can get data from the API. We'll use the API url above as one of the start_urls
However I noticed that the discount wasn't in the response so I believe that you will have to calculate it in the code.
import json
import scrapy
class StarbucksSpider(scrapy.Spider):
name = 'starbucks'
start_urls = [
'https://production-gameflip.fingershock.com/api/v1/listing?limit=150&kind=item&category=GIFTCARD&platform=starbucks&status=onsale&sort=_score:desc,shipping_within_days:asc,created:desc&accept_currency=USD'
]
def parse(self, response):
api_data = json.loads(response.text)
slots = api_data.get('data')
for slot in slots:
fullPrice = slot.get('name')
# I couldn't find in the json the discount
Discount = 'TODO - Calculate discount using values from API'
price = slot.get('price')
status = slot.get('status')
print('\n')
print(status)
print(fullPrice)
print(Discount)
print(price)
print('\n')