I'm creating a web api that scrapes a given url and sends that back. I am using Puppeteer to do this. I asked this question: Puppeteer not behaving like in Developer Console
and recieved an answer that suggested it would only work if headless was set to be false. I don't want to be constantly opening up a browser UI i don't need (I just the need the data!) so I'm looking for why headless has to be false and can I get a fix that lets headless = true.
Here's my code:
express()
.get("/*", (req, res) => {
global.notBaseURL = req.params[0];
(async () => {
const browser = await puppet.launch({ headless: false }); // Line of Interest
const page = await browser.newPage();
console.log(req.params[0]);
await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
title = await page.$eval("title", (el) => el.innerText);
browser.close();
res.send({
title: title,
});
})();
})
.listen(PORT, () => console.log(`Listening on ${PORT}`));
This is the page I'm trying to scrape: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106?origin=coordinating-5460106-0-1-FTR-recbot-recently_viewed_snowplow_mvp&recs_placement=FTR&recs_strategy=recently_viewed_snowplow_mvp&recs_source=recbot&recs_page_type=category&recs_seed=0&color=BLACK
The reason it might work in UI mode but not headless is that sites who aggressively fight scraping will detect that you are running in a headless browser.
Some possible workarounds:
puppeteer-extra
Found here: https://github.com/berstend/puppeteer-extra Check out their docs for how to use it. It has a couple plugins that might help in getting past headless-mode detection:
puppeteer-extra-plugin-anonymize-ua
-- anonymizes your User Agent. Note that this might help with getting past headless mode detection, but as you'll see if you visit https://amiunique.org/ it is unlikely to be enough to keep you from being identified as a repeat visitor.puppeteer-extra-plugin-stealth
-- this might help win the cat-and-mouse game of not being detected as headless. There are many tricks that are employed to detect headless mode, and as many tricks to evade them.It's possible to run a single browser UI in a manner that let's you attach puppeteer to that running instance. Here's an article that explains it: https://medium.com/@jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0
Essentially you're starting Chrome or Chromium (or Edge?) from the command line with --remote-debugging-port=9222
(or any old port?) plus other command line switches depending on what environment you're running it in. Then you use puppeteer to connect to that running instance instead of having it do the default behavior of launching a headless Chromium instance: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });
. Read the puppeteer docs here for more info: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions
The ENDPOINT_URL
is displayed in the terminal when you launch the browser from the command line with the --remote-debugging-port=9222
option.
This option is going to require some server/ops mojo, so be prepared to do a lot more Stack Overflow searches. :-)
There are other strategies I'm sure but those are the two I'm most familiar with. Good luck!