Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) after a certain number of pages are crawled

I am trying to crawl a list of provided URLs using Scrapy-Playwright. But I found a strange behavior. It starts crawling nicely, but each time it stops crawling after a certain number of pages are crawled and then it shows log like this:

[logstats.py:54] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 08:30:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

No matter which set of URL I provide it is always freezing like this at this point.

Following is my spider (my_spider.py):

import scrapy
from faker import Faker
from scrapy.spiders import Rule

fake = Faker()

class MySpider(scrapy.Spider):
    name = 'my_spider'


    def start_requests(self):
        # Define the initial URL(s) to scrape
        for url in self.start_urls:
            yield scrapy.Request(url, 
                headers={'User-Agent': fake.user_agent()},
                meta=dict(
                playwright=True,
                playwright_include_page=True,
                errback=self.errback,
            ))

    def __init__(self):
        # List of URLs to start scraping
        self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16

        self._rules = [Rule(callback = self.parse)]

    def parse(self, response):
        page_title = response.xpath('//title/text()').get()
        yield {
            'url': response.url,
            'title': page_title
        }

    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

In settings.py (as I am using Playwright):

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 60 * 1000  # 10 seconds

Why is this happening?

Solution

The reason was I was running out of allocated resources for this process. If you look carefully, the default concurrent request (found in settings.py) is also 16. And this is why it is frozen after 16 requests as the earlier requests were not releasing the resources. Following is the way it is fixed:

I am still keeping this to be 16:

CONCURRENT_REQUESTS = 16 # in settings.py

But now releasing the resources after usage (in my_spider.py):

import scrapy
from faker import Faker
from scrapy.spiders import Spider, Rule
import asyncio

fake = Faker()

class MySpider(scrapy.Spider):
    name = 'my_spider'


    def start_requests(self):
        # Define the initial URL(s) to scrape
        for url in self.start_urls:
            yield scrapy.Request(url, 
                headers={'User-Agent': fake.user_agent()},
                meta=dict(
                playwright=True,
                playwright_include_page=True,
                errback=self.errback,
            ))

    def __init__(self):
        # List of URLs to start scraping
        self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16

        self._rules = [Rule(callback = self.parse)]

    def parse(self, response):
        page = response.meta.get("playwright_page")
        # This is where you'll extract the data from the crawled pages.
        # As an example, we'll just print the title of each page.
        try:
            page_title = response.xpath('//title/text()').get()
            yield {
                'url': response.url,
                'title': page_title
            }
        finally:
            # Ensure the Playwright page is closed after processing
            if page:
                asyncio.ensure_future(page.close())


    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        if page:
            await page.close()

Also, these are the additional settings I have added in settings.py if you are interested to know:

# Increase concurrency
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# Increase timeouts
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 90 * 1000  # 90 seconds
DOWNLOAD_TIMEOUT = 120  # 120 seconds

# Retry failed requests
RETRY_ENABLED = True
RETRY_TIMES = 5

# Max Playwright contexts
PLAYWRIGHT_MAX_CONTEXTS = 4

# Logging level
LOG_LEVEL = 'DEBUG'

# Playwright download handlers
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

# Other deprecated settings
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"