I am trying to open a public company page on Linkedin using Puppeteer, but every time it is redirected to an authentication form. This does not happen when I manually paste the URL in Chromium or in Chrome.
This is the code:
const puppeteer = require("puppeteer");
(async () => {
const url = "https://www.linkedin.com/company/google/";
const browser = await puppeteer.launch({
headless: false,
args: [
"--lang=en-GB",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-gpu",
"--disable-dev-shm-usage",
],
defaultViewport: null,
pipe: true,
slowMo: 30,
});
const page = await browser.newPage();
await page.goto(url, {
waitUntil: 'networkidle0',
});
await page.waitForSelector(".top-card-layout__entity-info-container", { timeout: 10000 });
await page.close();
await browser.close();
})();
This is where the browser is redirected:
This does not happen if I manually paste the URL https://www.linkedin.com/company/google/
in Chromium or Chrome.
What I have tried so far:
incognito
browser context:// [...]
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
// [...]
const puppeteer = require("puppeteer-extra");
puppeteer.use(require("puppeteer-extra-plugin-stealth")());
// [...]
const randomUserAgent = require("random-useragent");
// [...]
await page.setUserAgent(randomUserAgent.getRandom());
// [...]
Nothing has worked. Is there anything else I can try?
It is due to Microsoft's extreme protection on the profiles. If you are able to visit the public profiles in incognito mode I think some shared cookies are responsible for this, but normally you cannot visit public company profiles on LinkedIn without logging in due to AuthWall (which blocks you in this case). For me the login is required all the time, even from non-incognito window.
A bit background from data expert John Koala:
When Microsoft bought LinkedIn they invested billions into the purchase. They also started to act, quite soon they battled scraping. Companies like the now famous, due to it’s court battle, “HiQ Labs” use the LinkedIn data to make a huge profit.
Now LinkedIn had the problem that public scraping is not a legal offense, they failed (like all other websites) t[o] prevent well developed public scraping.
So LinkedIn added and strengthened a feature called “Authwall”, that is a very sensitive scraping detection. It allows rarely any public views from non authorized accounts making scraping without account impossible.
Scraping with accounts is a legal offense and it’s a lot more difficult as accounts need to be maintained. This is when HiQ Labs and all other scraping companies went out of business. HiQ saw millions of profit going down the sink, they battled LinkedIn at court.
The only company left scraping them is “scraping.services“, it will stay interesting what is going to happen during the next years.
I am sure the fact that the whole ex-puppeteer team works now at Microsoft will not make it easier to deceive the AuthWall neither (see: even with puppeteer-extra-plugin-stealth is prevented to visit the page).
The only way to visit stably LinkedIn pages is to login with the form (or to use a chrome profile which is logged in and already has valid session cookies).
Update: As scraping itself with an existing account violates LinkedIn's user agreement: it is not advised to do such thing. My above solution applies only for one-time visits (which is not a valid scenario anyway). So the final answer is: it is not possible to visit these profiles with puppeteer.