I'm facing some problems with Puppeteer, I want to extract a list of items and succeed when headless is FALSE but not when TRUE.
First thing first, I want to get those elements before mapping on it.
Here's my script, maybe you can reproduce it, it is really basic.
const chalk = require("chalk");
const baseUrl = "https://www.interencheres.com/recherche/lots?search=";
const searchTerm = "Apple";
const searchUrl = baseUrl + searchTerm;
(async () => {
const browser = await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true,
args: [`--window-size=1920,1080`],
defaultViewport: {
width: 1920,
height: 1080,
},
});
const page = await browser.newPage();
// Begin navigation
console.log(chalk.yellow("Beginning navigation."));
await page.goto(searchUrl);
// Await List of elements;
console.log(chalk.yellow("Wait for Network Idle..."));
await page.waitForNetworkIdle();
// get Items
const findElements = await page.evaluate(() => {
const elements = document.querySelectorAll(".sale-item");
console.log(elements);
return elements;
});
console.log(findElements);
console.log(chalk.blue("Waiting..."));
await page.waitForTimeout(10000);
await browser.close();
console.log(chalk.red("Closed."));
})();
Expected results : {
'0': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'1': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'2': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'3': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'4': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
.
.
}
For starters, I'd prefer page.waitForSelector(yourSelector)
over page.waitForNetworkIdle();
. In most cases, it's a more direct guarantee that the data you want is on the page, whereas network idle can block waiting for all sorts of requests that are totally irrelevant to the data you're trying to scrape. Another option is page.waitForResponse(predicate)
.
Some websites check the headers to block scrapers. You can try changing the user agent header as described in the Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true } #665:
const puppeteer = require("puppeteer"); // ^22.6.0
const baseUrl = "https://www.interencheres.com/recherche/lots?search=";
const searchTerm = "Apple";
const searchUrl = baseUrl + encodeURIComponent(searchTerm);
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const ua =
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
await page.setUserAgent(ua);
await page.goto(searchUrl, {waitUntil: "domcontentloaded"});
await page.waitForSelector(".sale-item-wrapper");
const elements = await page.$$(".sale-item-wrapper");
console.log(elements.length); // => 48
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Using puppeteer-extra as described in Why does headless need to be false for Puppeteer to work? is another option you can try. It also uses random browser user agent headers, among other tricks to make the fingerprint less detectable to anti-bot services.