pythonseleniumimperva

How do I avoid imperva bot detection?


I am running a Python script that scrapes a website. It uses Imperva to detect automated scripts crawling through it's web pages. Imperva has blocked my IP from accessing the site as soon as I run the script. I did read someone suggest including a time.sleep(random.randint(a,b)) (to try and mimic human behaviour) in the script which it didn't work or perhaps it just wouldn't work as a standalone method. If it's the chrome driver itself that they detect then I guess it would be impossible to avoid. Does anyone have any practical suggestions on things that I could include in my script to bypass this?. Thanks in advance.


Solution

  • Introduction

    There are many different components that need to be added to a web scraper to make it undetectable. I recommend using the below code to test your current level of detection:

    driver.get("https://bot.sannysoft.com/")
    

    More than likely, you will fail most of those tests right off the bat, fortunately, it's easy to configure a scraper that will pass all of those tests and be completely undetectable.

    Selenium-Stealth

    selenium-stealth is a python package that is used to avoid detection. Simply...

    pip install selenium-stealth
    

    and follow the below configuration:

    stealth(driver,
            user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36',
            languages=["en-US", "en"],
            vendor="Google Inc.",
            platform="Win32",
            webgl_vendor="Intel Inc.",
            renderer="Intel Iris OpenGL Engine",
            fix_hairline=True,
            )
    

    Your web scraper should pass all of the tests, now try to implement this solution on the Imperva site.

    More information

    If you are still getting blocked, I recommend looking into the random-user-agent library to cycle your user agent within the "user_agent" variable of the selenium-stealth configuration. Otherwise, you could pay for a proxy provider to completely disguise your IP. Although keep in mind, proxy networks currently do not have a selenium configuration.

    Information on Proxy Network Selenium Configuration: Python Selenium Proxy Network

    Information on Selenium Detectability in the Cloud: Python Selenium AWS Lambda Change WebGL Vendor/Renderer For Undetectable Headless Scraper