pythonselenium-webdriverweb-scrapingscrapyscrapy-splash

Dynamically webscraping a website with multiple categories and pages using scrapy


I am attempting to scrape data from an online shop. I am currently able to scrape one page of one category from the store using the websites data API. However, I want to be able to scrape all categories and products and use the category name as a new column value so I can use it as an categorical identifier. I am unsure if I am supposed to use selenium or splashy for this as the page does not have a href and dynamically updates by every 50/100/150/200 products depending on your chosen filter options.

The website link is: https://eshop.nomin.mn/

The list of categories is as shown in the picture below: enter image description here

Example categories:

As you can see the website categories have different URLs, and also have different data APIs. The next page button does not have a HREF and the products are refreshed/updated dynamically. The HTML for the next page button is:

<a class="pagination-nextLinkNo-HSU" tabindex="0" role="button" aria-disabled="false" aria-label="Next page" rel="next"><img src="/rightArrow-jkZ.png" alt="1"></a>

Basically, I want to be able to scrape all products name, price, and product description from the website from all categories (inserted as a categorical id value). I have tried to emulate other posts but with no success. Any and all help is greatly appreciated. Thank you very much.

My current code for webscraping the foods page is:

import scrapy
from scrapy import Request
from datetime import datetime
import re

dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' Nomin CPI Foods Data'

class NominCPIFoodsSpider(scrapy.Spider):
    name = 'nomin_cpi_foods'
    allowed_domains = ['https://eshop.nomin.mn/n-foods.html/']
    custom_settings = {
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }

    # function used for start url
    def start_requests(self):
        urls = ['https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&operationName=category&variables={"currentPage":1,"id":24175,"filters":{"category_id":{"in":"24175"}},"pageSize":50,"sort":{"position":"DESC"}}']
        for url in urls:
            yield Request(url, self.parse)

    # function to parse
    def parse(self, response, **kwargs):
        data = response.json()
        print(data.keys())
        for item in data['data']["products"]["items"]:
            yield {
                "name": item["name"],
                "price": item["price"]["regularPrice"]["amount"]["value"],
                "description": item["short_description"]["html"]
            }

        # handles pagination
        next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
        if next_url:
            yield scrapy.Request(next_url, self.parse)

# main driver
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(NominCPIFoodsSpider)
    process.start()

Solution

  • All you have to do is deconstruct the API URL and figure out how to reverse engineer the api endpoint.

    For example example if you were to visit page 2 of that same website you would notice that it sends a different request to fetch the data for the items listed on the second page. Then you can compare the URLS and determine how to reconstruct them for the rest of the pages.

    So for this particular api it looks like all of the variables are contained at the end of the url. Specifically this part:

    {
        "currentPage": 1,  # adding 1 to this variable get's you the next page
        "id": 24175,    # changing this value changes what category of items
        "filters" : {
            "category_id":{
                "in" : "24175"  # this needs to change for categories too
            }
        },
        "pageSize" : 50,   # you can adjust the number of results per page with this.
        "sort" : {
            "position" : "DESC" 
    }
    

    So all you need to do is change the currentPage field of the dictionary and use the urls as your scrapy requests.

    import scrapy
    from scrapy import Request
    from datetime import datetime
    import re
    
    BASE_URL = "https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&operationName=category&variables="
    
    
    dt_today = datetime.now().strftime('%Y%m%d')
    filename = dt_today + ' Nomin CPI Foods Data'
    
    class NominCPIFoodsSpider(scrapy.Spider):
        name = 'nomin_cpi_foods'
        allowed_domains = ['https://eshop.nomin.mn/n-foods.html/']
        custom_settings = {
            "FEEDS": {
                f'{filename}.csv': {
                    'format': 'csv',
                    'overwrite': True}}
        }
    
        # function used for start url
        def start_requests(self):
            for i in range(50):
                url = BASE_URL + '{"currentPage":' + str(i) + ',"id":24175,"filters":{"category_id":{"in":"24175"}},"pageSize":50,"sort":{"position":"DESC"}}'
                yield Request(url, self.parse)
    
        # function to parse
        def parse(self, response, **kwargs):
            data = response.json()
            print(data.keys())
            for item in data['data']["products"]["items"]:
                yield {
                    "name": item["name"],
                    "price": item["price"]["regularPrice"]["amount"]["value"],
                    "description": item["short_description"]["html"]
                }
    
            # handles pagination
            next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
            if next_url:
                yield scrapy.Request(next_url, self.parse)
    
    
    if __name__ == "__main__":
        process = CrawlerProcess()
        process.crawl(NominCPIFoodsSpider)
        process.start()