pythonweb-scrapingscrapysplash-screenscrapinghub

How to scrape data on website if using Javascript with pagination


I have a website that's need to scrape the data "https://www.forever21.com/us/shop/catalog/category/f21/sale#pageno=1&pageSize=120&filter=price:0,250&sort=5" but I cannot retrieve all the data it also has pagination and Its uses javascript as well.

any idea on how I will scrape all the items? Here's my code

def parse_2(self, response):


    for product_item_forever in response.css('div.pi_container'):
        item = GpdealsSpiderItem_f21()

        f21_title = product_item_forever.css('p.p_name::text').extract_first()
        f21_regular_price = product_item_forever.css('span.p_old_price::text').extract_first()
        f21_sale_price = product_item_forever.css('span.p_sale.t_pink::text').extract_first()
        f21_photo_url = product_item_forever.css('img::attr(data-original)').extract_first()
        f21_description_url = product_item_forever.css('a.item_slider.product_link::attr(href)').extract_first()

        item['f21_title'] = f21_title 
        item['f21_regular_price'] = f21_regular_price 
        item['f21_sale_price'] = f21_sale_price 
        item['f21_photo_url'] = f21_photo_url 
        item['f21_description_url'] = f21_description_url 

        yield item

Please help Thank you


Solution

  • One of the first steps in web scraping project should be looking for an API that the website uses to get the data. Not only does it save you parsing HTML, using an API also saves provider's bandwidth and server load. To look for an API, use your browser's developer tools and look for XHR requests in the network tab. In your case, the web site makes POST requests to this URL:

    https://www.forever21.com/eu/shop/Catalog/GetProducts

    You can then simulate the XHR request in Scrapy to get the data in JSON format. Here's the code for the spider:

    # -*- coding: utf-8 -*-
    import json
    import scrapy
    
    class Forever21Spider(scrapy.Spider):
        name = 'forever21'
    
        url = 'https://www.forever21.com/eu/shop/Catalog/GetProducts'
        payload = {
            'brand': 'f21',
            'category': 'sale',
            'page': {'pageSize': 60},
            'filter': {
                'price': {'minPrice': 0, 'maxPrice': 250}
            },
            'sort': {'sortType': '5'}
        }
    
        def start_requests(self):
            # scrape the first page
            payload = self.payload.copy()
            payload['page']['pageNo'] = 1
            yield scrapy.Request(
                self.url, method='POST', body=json.dumps(payload),
                headers={'X-Requested-With': 'XMLHttpRequest',
                         'Content-Type': 'application/json; charset=UTF-8'},
                callback=self.parse, meta={'pageNo': 1}
            )
    
        def parse(self, response):
            # parse the JSON response and extract the data
            data = json.loads(response.text)
            for product in data['CatalogProducts']:
                item = {
                    'title': product['DisplayName'],
                    'regular_price': product['OriginalPrice'],
                    'sale_price': product['ListPrice'],
                    'photo_url': 'https://www.forever21.com/images/default_330/%s' % product['ImageFilename'],
                    'description_url': product['ProductShareLinkUrl']
                }
                yield item
    
            # simulate pagination if we are not at the end
            if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
                payload = self.payload.copy()
                payload['page']['pageNo'] = response.meta['pageNo'] + 1
                yield scrapy.Request(
                    self.url, method='POST', body=json.dumps(payload),
                    headers={'X-Requested-With': 'XMLHttpRequest',
                             'Content-Type': 'application/json; charset=UTF-8'},
                    callback=self.parse, meta={'pageNo': payload['page']['pageNo']}
                )