node.jsapifycrawlee

Blocking specific resources (css, images, videos, etc) using crawlee and playwright


I'm using crawlee@3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:

import { launchPlaywright, playwrightUtils } from 'crawlee';

const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
    // extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();

I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 3,
    async requestHandler({ page, request }) {
        console.log(`Processing: ${request.url}`);
        await playwrightUtils.blockRequests(page);
        await page.screenshot({ path: 'cnn_no_images2.png' });
    },
});

This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.

Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.


Solution

  • you can set any listeners or code before navigation by using preNavigationHooks like this:

    
    const crawler = new PlaywrightCrawler({
        maxRequestsPerCrawl: 3,
        preNavigationHooks: [async ({ page }) => {
            await playwrightUtils.blockRequests(page);
        }],
        async requestHandler({ page, request }) {
            console.log(`Processing: ${request.url}`);
            await page.screenshot({ path: 'cnn_no_images2.png' });
        },
    });