web-scrapingscrapyplaywright

Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) after a certain number of pages are crawled


I am trying to crawl a list of provided URLs using Scrapy-Playwright. But I found a strange behavior. It starts crawling nicely, but each time it stops crawling after a certain number of pages are crawled and then it shows log like this:

[logstats.py:54] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 08:30:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

No matter which set of URL I provide it is always freezing like this at this point.

Following is my spider (my_spider.py):

import scrapy
from faker import Faker
from scrapy.spiders import Rule

fake = Faker()

class MySpider(scrapy.Spider):
    name = 'my_spider'


    def start_requests(self):
        # Define the initial URL(s) to scrape
        for url in self.start_urls:
            yield scrapy.Request(url, 
                headers={'User-Agent': fake.user_agent()},
                meta=dict(
                playwright=True,
                playwright_include_page=True,
                errback=self.errback,
            ))

    def __init__(self):
        # List of URLs to start scraping
        self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16

        self._rules = [Rule(callback = self.parse)]

    def parse(self, response):
        page_title = response.xpath('//title/text()').get()
        yield {
            'url': response.url,
            'title': page_title
        }

    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

In settings.py (as I am using Playwright):

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 60 * 1000  # 10 seconds

Why is this happening?


Solution

  • The reason was I was running out of allocated resources for this process. If you look carefully, the default concurrent request (found in settings.py) is also 16. And this is why it is frozen after 16 requests as the earlier requests were not releasing the resources. Following is the way it is fixed:

    I am still keeping this to be 16:

    CONCURRENT_REQUESTS = 16 # in settings.py
    

    But now releasing the resources after usage (in my_spider.py):

    import scrapy
    from faker import Faker
    from scrapy.spiders import Spider, Rule
    import asyncio
    
    fake = Faker()
    
    class MySpider(scrapy.Spider):
        name = 'my_spider'
    
    
        def start_requests(self):
            # Define the initial URL(s) to scrape
            for url in self.start_urls:
                yield scrapy.Request(url, 
                    headers={'User-Agent': fake.user_agent()},
                    meta=dict(
                    playwright=True,
                    playwright_include_page=True,
                    errback=self.errback,
                ))
    
        def __init__(self):
            # List of URLs to start scraping
            self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16
    
            self._rules = [Rule(callback = self.parse)]
    
        def parse(self, response):
            page = response.meta.get("playwright_page")
            # This is where you'll extract the data from the crawled pages.
            # As an example, we'll just print the title of each page.
            try:
                page_title = response.xpath('//title/text()').get()
                yield {
                    'url': response.url,
                    'title': page_title
                }
            finally:
                # Ensure the Playwright page is closed after processing
                if page:
                    asyncio.ensure_future(page.close())
    
    
        async def errback(self, failure):
            page = failure.request.meta["playwright_page"]
            if page:
                await page.close()
    

    Also, these are the additional settings I have added in settings.py if you are interested to know:

    # Increase concurrency
    CONCURRENT_REQUESTS = 16
    CONCURRENT_REQUESTS_PER_DOMAIN = 8
    
    # Increase timeouts
    PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 90 * 1000  # 90 seconds
    DOWNLOAD_TIMEOUT = 120  # 120 seconds
    
    # Retry failed requests
    RETRY_ENABLED = True
    RETRY_TIMES = 5
    
    # Max Playwright contexts
    PLAYWRIGHT_MAX_CONTEXTS = 4
    
    # Logging level
    LOG_LEVEL = 'DEBUG'
    
    # Playwright download handlers
    DOWNLOAD_HANDLERS = {
        "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    }
    
    # Other deprecated settings
    REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
    TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    FEED_EXPORT_ENCODING = "utf-8"