seleniumselenium-webdriverwebdrivercaptchawebdriver-w3c-spec

Selenium and non-headless browser keeps asking for Captcha


I was running into an issue in which one of our sites kept asking for captcha in headless mode in a browser in the cloud, so I switched it to non headless, so I could enter the captcha myself, and I thought the next times it would work, perhaps because some cookies would be stored already, but it didn't even though I entered the captcha several times.

Also it's worth mentioning that it runs just fine locally in whatever mode, and it also runs well in the cloud for the non automated version, but as soon as as I run l it there with Selenium in whatever mode it keeps asking for the captcha. Any ideas what might be happening and ideas on the solution are greatly appreciated


Solution

  • In the discussion entitled How does recaptcha 3 know I'm using selenium/chromedriver we have discussed about some generic approaches to avoid getting detected while web-scraping. Let's deep dive.


    Headless Browser

    A headless browser is a browser that can be used without a graphical interface. It can be controlled programmatically to automate tasks, such as doing tests or taking screenshots of webpages.


    Why detect headless browser?

    As per @AntoineVastel, headless browsers are used to automate malicious tasks. The most common cases are web scraping, increase advertisement impressions or look for vulnerabilities on a website.

    Until an year ago, one of the most popular headless browser was PhantomJS. Since it is built on the Qt framework, it exhibits many differences compared to most popular browsers. It was possible to detect PhantomJS using some browser fingerprinting techniques. Since version 59, Google released a headless version of its Chrome browser. Unlike PhantomJS, it is based on a vanilla Chrome, and not on an external framework, making its presence more difficult to detect. So there are likely other ways to detect Chrome headless.


    Detecting Chrome Headless

    These are some of the crucial factors why headless browsers are more prone to get detected.


    Outro