I am working on a Streamlit app that runs properly locally, but when I initiate the web scraping process with multiple threads, the website freezes and the process is killed. The logs in the console indicate the links are being scraped, so I am not sure what is causing the issue. Does anyone have any ideas as to why this is happening?
2023-02-20 19:50:11.308 Get LATEST chromedriver version for google-chrome 110.0.5481
2023-02-20 19:50:11.310 Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/110.0.5481/chromedriver] found in cache
multithreading func:
threads = []
for i in links:
t = threading.Thread(target=get_links, args=(i, resumeContent))
threads.append(t)
t.start()
for t in threads:
t.join()
You are using selenium4 and executable_path has been deprecated and you have to pass in a Service
object.
So effectively, instead of:
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
you need to pass:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
so once the matching ChromeDriver is downloaded, it can be reused.
As per your question update, as you implemented Threading
, the number of threads you are trying to spawn will always be a matter of concern provided you have only 2 GB of memory.