amazon-web-servicesselenium-webdriveramazon-ec2webdriver-manager

Selenium is randomly killed on AWS EC2


I am working on a Streamlit app that runs properly locally, but when I initiate the web scraping process with multiple threads, the website freezes and the process is killed. The logs in the console indicate the links are being scraped, so I am not sure what is causing the issue. Does anyone have any ideas as to why this is happening?

2023-02-20 19:50:11.308 Get LATEST chromedriver version for google-chrome 110.0.5481
2023-02-20 19:50:11.310 Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/110.0.5481/chromedriver] found in cache

enter image description here

multithreading func:

 threads = []
        for i in links:
            t = threading.Thread(target=get_links, args=(i, resumeContent))
            threads.append(t)
            t.start()
        for t in threads:
            t.join()

Solution

  • You are using and executable_path has been deprecated and you have to pass in a Service object.

    So effectively, instead of:

    driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
    

    you need to pass:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    

    so once the matching ChromeDriver is downloaded, it can be reused.


    Update

    As per your question update, as you implemented Threading, the number of threads you are trying to spawn will always be a matter of concern provided you have only 2 GB of memory.