With Apify/Puppeteer, how can I crawl all pages except those that include a certain word?
Inside of the handlePageFunction, the original code looks like this
await Apify.utils.enqueueLinks({
requestQueue,
page,
pseudoUrls: [
baseurl + '[.*]',
],
});
This crawls all pages. If I want to avoid page URLs that contain "foo", is there anyway I could adjust something within pseudoUrls to fix that?
As per Apify documentation for PseudoUrls:
A PURL is simply a URL with special directives enclosed in [] brackets. Currently, the only supported directive is [RegExp], which defines a JavaScript-style regular expression to match against the URL.
Therefore you can include a regex that would prevent matching urls that contain foo
by embedding a regular expression with negative lookahead at the front, like this:
await Apify.utils.enqueueLinks({
// ...
pseudoUrls: [
'[(?!.*foo)]' + baseurl + '[.*]',
],
});
What this does:
[
+ ]
mean that this part of the pseudoUrl is an embedded regex(?!
+ )
denominates a negative lookahead group in a regular expression. This means that if the sub-regex contained inside matches, a match is refused for the main (outer) regex..*
means that any characters may precede the string that you want to avoid matchingfoo
is the string you want to avoid matching