puppeteerapify

With Apify/Puppeteer, crawl all URLs except those that contain a word


With Apify/Puppeteer, how can I crawl all pages except those that include a certain word?

Inside of the handlePageFunction, the original code looks like this

        await Apify.utils.enqueueLinks({
            requestQueue,
            page,
            pseudoUrls: [
                baseurl + '[.*]',
            ],
        });

This crawls all pages. If I want to avoid page URLs that contain "foo", is there anyway I could adjust something within pseudoUrls to fix that?


Solution

  • As per Apify documentation for PseudoUrls:

    A PURL is simply a URL with special directives enclosed in [] brackets. Currently, the only supported directive is [RegExp], which defines a JavaScript-style regular expression to match against the URL.

    Therefore you can include a regex that would prevent matching urls that contain foo by embedding a regular expression with negative lookahead at the front, like this:

    await Apify.utils.enqueueLinks({
        // ...
        pseudoUrls: [
            '[(?!.*foo)]' + baseurl + '[.*]',
        ],
    });
    

    What this does: