I am trying to crawl a list of provided URLs using Scrapy-Playwright
. But I found a strange behavior. It starts crawling nicely, but each time it stops crawling after a certain number of pages are crawled and then it shows log like this:
[logstats.py:54] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 08:30:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
No matter which set of URL I provide it is always freezing like this at this point.
Following is my spider (my_spider.py
):
import scrapy
from faker import Faker
from scrapy.spiders import Rule
fake = Faker()
class MySpider(scrapy.Spider):
name = 'my_spider'
def start_requests(self):
# Define the initial URL(s) to scrape
for url in self.start_urls:
yield scrapy.Request(url,
headers={'User-Agent': fake.user_agent()},
meta=dict(
playwright=True,
playwright_include_page=True,
errback=self.errback,
))
def __init__(self):
# List of URLs to start scraping
self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16
self._rules = [Rule(callback = self.parse)]
def parse(self, response):
page_title = response.xpath('//title/text()').get()
yield {
'url': response.url,
'title': page_title
}
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
await page.close()
In settings.py
(as I am using Playwright
):
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 60 * 1000 # 10 seconds
Why is this happening?
The reason was I was running out of allocated resources for this process. If you look carefully, the default concurrent request (found in settings.py
) is also 16. And this is why it is frozen after 16 requests as the earlier requests were not releasing the resources. Following is the way it is fixed:
I am still keeping this to be 16:
CONCURRENT_REQUESTS = 16 # in settings.py
But now releasing the resources after usage (in my_spider.py
):
import scrapy
from faker import Faker
from scrapy.spiders import Spider, Rule
import asyncio
fake = Faker()
class MySpider(scrapy.Spider):
name = 'my_spider'
def start_requests(self):
# Define the initial URL(s) to scrape
for url in self.start_urls:
yield scrapy.Request(url,
headers={'User-Agent': fake.user_agent()},
meta=dict(
playwright=True,
playwright_include_page=True,
errback=self.errback,
))
def __init__(self):
# List of URLs to start scraping
self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16
self._rules = [Rule(callback = self.parse)]
def parse(self, response):
page = response.meta.get("playwright_page")
# This is where you'll extract the data from the crawled pages.
# As an example, we'll just print the title of each page.
try:
page_title = response.xpath('//title/text()').get()
yield {
'url': response.url,
'title': page_title
}
finally:
# Ensure the Playwright page is closed after processing
if page:
asyncio.ensure_future(page.close())
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
if page:
await page.close()
Also, these are the additional settings I have added in settings.py
if you are interested to know:
# Increase concurrency
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# Increase timeouts
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 90 * 1000 # 90 seconds
DOWNLOAD_TIMEOUT = 120 # 120 seconds
# Retry failed requests
RETRY_ENABLED = True
RETRY_TIMES = 5
# Max Playwright contexts
PLAYWRIGHT_MAX_CONTEXTS = 4
# Logging level
LOG_LEVEL = 'DEBUG'
# Playwright download handlers
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
# Other deprecated settings
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"