node.jspuppeteerpuppeteer-cluster

Puppeteer: how to wait only first response (HTML)


I'm using puppeteer-cluster to crawling web pages.

If I open many pages at time per single website (8-10 pages), the connection slow down and many timeout errors coming up, like this:

TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded

I need to access only the HTML code of each page. I don't need to wait for domcontentloaded and so on.

Is there a way to tell page.goto() to wait only the first response from the webserver? Or I need to use another technology instead of puppeteer?


Solution

  • The domcontentloaded is the event for first html content.

    The DOMContentLoaded event fires when the initial HTML document has been completely loaded and parsed, without waiting for stylesheets, images, and subframes to finish loading.

    The following will finish loading just when the initial HTML document is loaded.

    await page.goto(url, {waitUntil: 'domcontentloaded'})
    

    However, you can block images or stylesheets to save your bandwidth and load even faster in case you are loading 10 pages at once.

    Put the code below on the right place (before navigating using page.goto) and it will stop loading image, stylesheet, font and scripts.

    await page.setRequestInterception(true);
    page.on('request', (request) => {
        if (['image', 'stylesheet', 'font', 'script'].indexOf(request.resourceType()) !== -1) {
            request.abort();
        } else {
            request.continue();
        }
    });