javascriptnode.jspuppeteerpuppeteer-cluster

How do I combine puppeteer plugins with puppeteer clusters?


I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer. I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses. Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) . There is a problem though:

const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';


const browser = await puppeteer.launch({
    dumpio: false,
    headless: false,
    args: [
        `--proxy-server=socks5://127.0.0.1:${TOR_port}`,
        `--no-sandbox`,
    ],
    ignoreHTTPSErrors: true,
});

try {
    const page = await browser.newPage();
    await page.setViewport({ width: 1280, height: 720 });
    await page.goto(webUrl, {
        waitUntil: 'load',
        timeout: 30000,
    });

    page
    .waitForSelector('.price')
    .then(() => {
        console.log('The price is available');
        await browser.close();
    })
    .catch(() => {
        // close this since it is clearly not a zillow website
        throw new Error('This is not the zillow website');
    });
} catch (e) {
    await browser.close();
}

The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.

So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.

How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?


Solution

  • You can just hand over your puppeteer Instance like following:

    const puppeteer = require('puppeteer-extra');
    const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
    const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
    
    puppeteer.use(_StealthPlugin());
    puppeteer.use(_AdblockerPlugin());
    
    const browser = await puppeteer.launch({
        puppeteer,
    });
    

    Src: https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions