I have a website that's need to scrape the data "https://www.forever21.com/us/shop/catalog/category/f21/sale#pageno=1&pageSize=120&filter=price:0,250&sort=5" but I cannot retrieve all the data it also has pagination and Its uses javascript as well.
any idea on how I will scrape all the items? Here's my code
def parse_2(self, response):
for product_item_forever in response.css('div.pi_container'):
item = GpdealsSpiderItem_f21()
f21_title = product_item_forever.css('p.p_name::text').extract_first()
f21_regular_price = product_item_forever.css('span.p_old_price::text').extract_first()
f21_sale_price = product_item_forever.css('span.p_sale.t_pink::text').extract_first()
f21_photo_url = product_item_forever.css('img::attr(data-original)').extract_first()
f21_description_url = product_item_forever.css('a.item_slider.product_link::attr(href)').extract_first()
item['f21_title'] = f21_title
item['f21_regular_price'] = f21_regular_price
item['f21_sale_price'] = f21_sale_price
item['f21_photo_url'] = f21_photo_url
item['f21_description_url'] = f21_description_url
yield item
Please help Thank you
One of the first steps in web scraping project should be looking for an API that the website uses to get the data. Not only does it save you parsing HTML, using an API also saves provider's bandwidth and server load. To look for an API, use your browser's developer tools and look for XHR requests in the network tab. In your case, the web site makes POST requests to this URL:
https://www.forever21.com/eu/shop/Catalog/GetProducts
You can then simulate the XHR request in Scrapy to get the data in JSON format. Here's the code for the spider:
# -*- coding: utf-8 -*-
import json
import scrapy
class Forever21Spider(scrapy.Spider):
name = 'forever21'
url = 'https://www.forever21.com/eu/shop/Catalog/GetProducts'
payload = {
'brand': 'f21',
'category': 'sale',
'page': {'pageSize': 60},
'filter': {
'price': {'minPrice': 0, 'maxPrice': 250}
},
'sort': {'sortType': '5'}
}
def start_requests(self):
# scrape the first page
payload = self.payload.copy()
payload['page']['pageNo'] = 1
yield scrapy.Request(
self.url, method='POST', body=json.dumps(payload),
headers={'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json; charset=UTF-8'},
callback=self.parse, meta={'pageNo': 1}
)
def parse(self, response):
# parse the JSON response and extract the data
data = json.loads(response.text)
for product in data['CatalogProducts']:
item = {
'title': product['DisplayName'],
'regular_price': product['OriginalPrice'],
'sale_price': product['ListPrice'],
'photo_url': 'https://www.forever21.com/images/default_330/%s' % product['ImageFilename'],
'description_url': product['ProductShareLinkUrl']
}
yield item
# simulate pagination if we are not at the end
if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
payload = self.payload.copy()
payload['page']['pageNo'] = response.meta['pageNo'] + 1
yield scrapy.Request(
self.url, method='POST', body=json.dumps(payload),
headers={'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json; charset=UTF-8'},
callback=self.parse, meta={'pageNo': payload['page']['pageNo']}
)