node.jscheerioapify

How to fix: "Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon"


Cheerio crawler is not crawling when I set maxRequestPerCrawl to 1. Even when I set maxRequestPerCrawl to 10 or 100, after the 10th or 100th request nothing will be crawled again anymore. How can I fix that limitation?

I use a new instance of Cheerio for any single request, no parallel requests are necessary in my use cases. However, it counts requests on a global basis, whether I use a new instance of Cheerio for every request or a shared instance.

Once the count of all requests reaches the value of maxRequestPerCrawl, it will deny all further requests. The only solution is to shutdown the full process and start it again.

Log:

INFO  CheerioCrawler: Starting the crawl
INFO  CheerioCrawler: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon. Requests that are in progress will be allowed to finish.
INFO  CheerioCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 1 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 1 requests and will shut down.
INFO  CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":190}

My Code:

 const crawler = new CheerioCrawler({
      minConcurrency: 1,
      maxConcurrency: 1,

      //      proxyConfiguration: {},

      // On error, retry each page at most once.
      maxRequestRetries: 1,

      // Increase the timeout for processing of each page.
      requestHandlerTimeoutSecs: 30,

      // Limit to 10 requests per one crawl
      maxRequestsPerCrawl: 1,

     async requestHandler({ request, $, proxyInfo }) {
            // ...
     }
  )}

    await crawler.run([url]);
    await crawler.teardown();

What am I missing here to be able to run the crawler even with thousands of requests in a row?


Solution

  • As outlined in this GitHub comment, you can prevent this from happening by providing a new Configuration to you crawler each time:

    const crawler = new CheerioCrawler(
          {
            maxRequestsPerCrawl: 10,
            minConcurrency: 1,
            maxConcurrency: 5,
    
            async requestHandler({ request, $, enqueueLinks, log }) {
             // Crawl
            },
    
            async failedRequestHandler({ request, log }) {
              log.error(`Request failed: ${request.url}`);
            },
          },
          new Configuration({ persistStorage: false })
        );