pythonweb-scrapingscrapyscrapy-splash

Scrapy splash does not load dynamic content


I am using Splash with Scrapy to load dynamically rendered content in a page, but it does not work as I expected.

In setting.py I set these variables:

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
}
SPLASH_URL="http://localhost:8050"
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
SPLASH_COOKIES_DEBUG = False

The spider:

def start_requests(self):
        urls = [
            "https://callmeduy.com/san-pham/"
        ]
        for url in urls:
            yield SplashRequest(url=url, 
                                # endpoint='render.html', 
                                callback=self.parse, 
                                args={
                                    'wait': 5
                                })

def parse(self, response):
        print(response.xpath("//body").get())
        f = open('res.html', 'w+')
        f.write(response.xpath("//body").get())
        f.close()

The dynamic content has not been loaded. Here is the response body.

Please help if anybody knows.


Solution

  • I couldn't get this to work with Splash. Probably because I'm not very familiar with it.

    However, I have a working solution that uses Scrapy and Playwright.

    This is requirements.txt:

    Scrapy==2.11.2
    playwright==1.44.0
    scrapy-playwright==0.0.35
    beautifulsoup4==4.12.3
    

    In settings.py:

    DOWNLOAD_HANDLERS = {
      "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
      "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    }
    PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = (
      30 * 1000
    )
    PLAYWRIGHT_BROWSER_TYPE = "chromium"
    PLAYWRIGHT_LAUNCH_OPTIONS = {
      "headless": False
    }
    

    And the spider:

    import scrapy
    import time
    from bs4 import BeautifulSoup
    
    
    class CallmeduySpider(scrapy.Spider):
        name = "callmeduy"
        allowed_domains = ["callmeduy.com"]
    
        def start_requests(self):
            url = "https://callmeduy.com/san-pham"
            yield scrapy.Request(
                url,
                meta=dict(
                    playwright=True,
                    playwright_include_page=True,
                ),
            )
    
        async def parse(self, response):
            page = response.meta["playwright_page"]
    
            while True:
                soup = BeautifulSoup(await page.content(), "lxml")
                wait = soup.select_one(".card-title.h5 > span span.react-loading-skeleton")
    
                if not wait:
                    self.logger.debug("====================================================")
                    for card in soup.select(".jss23 .row .col-12"):
                        link = card.select_one("a.jss29")
                        title = card.select_one(".card-title.h5 > span.jss31")
    
                        self.logger.debug(title.get_text())
                        self.logger.debug(link["href"])
    
                        # TODO: Probably yield another scrapy.Request() here for each product?
                    self.logger.debug("====================================================")
    
                    return
                else:
                    self.logger.info("Waiting for skeleton to load.")
                    time.sleep(5)
    

    The key thing was to ensure that the content of the page is fully rendered. This is a rather brute-force way to do it (and could probably be done a lot more elegantly!), but it's a pragmatic solution.

    Sample output:

    2024-06-17 09:25:41 [callmeduy] DEBUG: ====================================================                                                                                                                                             
    2024-06-17 09:25:41 [callmeduy] DEBUG: Sữa Chống Nắng Bí Đao Cocoon                                                                                                                                                                     
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1451                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: COSRX The Hyaluronic Acid 3...                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1450                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: Kem chống nắng Skin1004 Mad...                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1449                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: COSRX The Niacinamide 15% S...                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1448                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: Nacific Origin Red Salicyli...                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1446                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: Skin1004 Madagascar Centell...                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1445                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: ACNACARE GEL Mega We Care                                                                                                                                                                        
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1444                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: Viên uống ACNACARE Mega We ...                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1443                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: Serum NNO VITE Mega We Care                                                                                                                                                                      
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1442                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: Neutrogena Hydro Boost Acti...                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1441                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: Neutrogena Hydroboost Clean...                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1440                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: Skin Recovery Cream                                                                                                                                                                              
    2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1439                                                                                                                                                                                   
    2024-06-17 09:25:41 [callmeduy] DEBUG: ====================================================
    

    As noted in a comment in the code, you will probably want to follow those links to the actual product pages (or perhaps not?). You'll also need to handle the pagination of the results on the index page.

    However, this code should get you started with something that at least yields results.