javascriptweb-crawlerplaywrightapifycrawlee

Playwright Crawler Error: "Target page, context or browser has been closed"


I am using Playwright with Crawlee to crawl web pages and analyze data. However, I'm encountering a persistent issue where, upon trying to access page data using locator methods, I receive the following error message: "Target page, context or browser has been closed". I have reviewed my code and checked various resources but have not been able to identify the cause of this error.

Here is the snippet where the error occurs, particularly on invoking locator.count():

const myComparisonFunction: MyAnalyzerComparator<Frame | Page> = async (element, data) => {
  if(!data.length) return false
  
  // Creating an XPath expression that checks for each data.value to be present in any order
  const xpathChecks = data.map((d) => `contains(., "${d.value}")`).join(' and ')
  const locator = element.locator(`xpath=//tr[${xpathChecks}]`);

  // Using locator to query for rows that match all conditions
  const matchingRowsCount = await locator.count()
  return matchingRowsCount > 0
}

The above function is part of a larger class that utilizes Crawlee and Playwright to resolve page URLs and handle data extraction and analysis. Below is the class responsible for setting up the Playwright crawler:

import { BasicCrawler, PlaywrightCrawler } from 'crawlee';
import { Page } from 'playwright';

// ... [Rest of the CrawleePromiseResolver and PlaywrightCrawleePageResolver class definitions]

export class PlaywrightCrawleePageResolver extends CrawleePromiseResolver<PlaywrightCrawler, Page> {
  constructor() {
    super((resolveUrl) => new PlaywrightCrawler({
      keepAlive: true,
      maxRequestsPerCrawl: 10,
      // ... [Other configuration options]
      async requestHandler({ page, request, log }) {
        log.info(`Processing ${request.url} ...`);
        resolveUrl(request.url, page);
      },
    }));
  }
}

And here's how I'm using the PlaywrightCrawleePageResolver class:

const playwrightCrawleePageResolver = new PlaywrightCrawleePageResolver()

// ... [Usage of playwrightCrawleePageResolver to define proximity in HTML]

const analyzeElementProximity: MyStrategyType<Frame | Page> = async ({ element, rules, dataItems }) => {
  const analyzer = new MyAnalyzerClass(rules, dataItems, myComparisonFunction);
  await analyzer.analyze(element);
  return analyzer.detectedIds;
}

It's at the defineProximityInHtmlTable function call where the locator.count() method is eventually called and where the error occurs.

I am using Crawlee and Playwright in an integrated manner to resolve promises with crawled pages, analyze the contents, and extract information. The error seems to happen after the pages are crawled, and it's time to interact with the elements via Playwright's locators.

I'd greatly appreciate if anyone could point out what might be causing this issue or provide guidance on how to prevent the "Target page, context or browser has been closed" error from occurring when accessing page data.

If you need any additional data to be able to help me just let me know, thanks.

To diagnose the issue, I took the following steps:

  1. Debugging the Code: I inserted breakpoints and used logging statements to trace the execution flow, ensuring that the error occurs specifically at the locator.count() call.
  2. Reviewing Playwright Documentation: I thoroughly read through the Playwright API documentation to ensure I am using the locator object and the count() method correctly.
  3. Searching Online: I searched for similar issues on forums, Stack Overflow, and GitHub issues related to Playwright and Crawlee. However, I did not find any scenarios that closely match mine.
  4. Checking for Race Conditions: I considered the possibility of a race condition where the page might be closing before the locator can perform the count, but I couldn't find concrete evidence of this in my code.
  5. Reviewing Crawlee Documentation: I reviewed Crawlee's documentation to ensure that I'm handling the asynchronous nature of page resolution correctly when integrating with Playwright.

I was expecting that, after successfully crawling the pages, I would be able to analyze the HTML content using Playwright's locators without encountering any closed context or browser errors, especially because the pages should be active and the context should be open due to keepAlive: true.

I am looking for insights or suggestions on what might be going wrong and how to ensure that the target page remains open for the duration needed to analyze the page content.


Solution

  • To make it work you need to put all logic into requestHandler and if this is not possible because of context/logic etc. you need to put there a callback which will execute the logic in different place.