I have a classic "it works on my machine" problem, a web scraper I ran successfully on my laptop, but with a persistent error whenever I tried and run it in a container.
My minimal reproducible dockerized example consists of the following files:
requirements.txt:
selenium==4.23.1 # 4.23.1
pandas==2.2.2
pandas-gbq==0.22.0
tqdm==4.66.2
Dockerfile:
FROM selenium/standalone-chrome:latest
# Set the working directory in the container
WORKDIR /usr/src/app
# Copy your application files
COPY . .
# Install Python and pip
USER root
RUN apt-get update && apt-get install -y python3 python3-pip python3-venv
# Create a virtual environment
RUN python3 -m venv /usr/src/app/venv
# Activate the virtual environment and install dependencies
RUN . /usr/src/app/venv/bin/activate && \
pip install --no-cache-dir -r requirements.txt
# Switch back to the selenium user
USER seluser
# Set the entrypoint to activate the venv and run your script
CMD ["/bin/bash", "-c", "source /usr/src/app/venv/bin/activate && python -m scrape_ev_files"]
scrape_ev_files.py (slimmed down to just what's needed to repro error):
import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
def init_driver(local_download_path):
os.makedirs(local_download_path, exist_ok=True)
# Set Chrome Options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--remote-debugging-port=9222")
prefs = {
"download.default_directory": local_download_path,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing.enabled": True
}
chrome_options.add_experimental_option("prefs", prefs)
# Set up the driver
service = Service()
chrome_options = Options()
driver = webdriver.Chrome(service=service, options=chrome_options)
# Set download behavior
driver.execute_cdp_cmd("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": local_download_path
})
return driver
if __name__ == "__main__":
# PARAMS
ELECTION = '2024 MARCH 5TH DEMOCRATIC PRIMARY'
ORIGIN_URL = "https://earlyvoting.texas-election.com/Elections/getElectionDetails.do"
CSV_DL_DIR = "downloaded_files"
# initialize the driver
driver = init_driver(local_download_path=CSV_DL_DIR)
shell command to reproduce the error:
docker build -t my_scraper . # (no error)
docker run --rm -t my_scraper # (error)
stacktrace from error is below. Any help would be much appreciated! I've tried many iterations of my requirements.txt and Dockerfile attempting to fix this, but this error at this spot has been frustratingly persistent:
File "/workspace/scrape_ev_files.py", line 110, in <module>
driver = init_driver(local_download_path=CSV_DL_DIR)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/scrape_ev_files.py", line 47, in init_driver
driver = webdriver.Chrome(service=service, options=chrome_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/chrome/webdriver.py", line 45, in __init__
super().__init__(
File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/chromium/webdriver.py", line 66, in __init__
super().__init__(command_executor=executor, options=options)
File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 212, in __init__
self.start_session(capabilities)
File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 299, in start_session
response = self.execute(Command.NEW_SESSION, caps)["value"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 354, in execute
self.error_handler.check_response(response)
File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
(session not created: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
You override the chrome_options
variable just before sending it to webdriver.Chrome()
so there are no options defined, --disable-dev-shm-usage
(this option solves that issue) in particular.
Just remove chrome_options = Options()
just before the driver initialization.
As a side note, consider using --headless=new
instead of --headless
, it gives functionality closer to regular chrome and --headless
will be deprecated in future versions.
Edit
The image you are using is turning off the Selenium manager, so you get this warning. You can turn it back on by adding ENV SE_OFFLINE=false
to the dockerfile.
The driver initialization sometimes hangs and raise TimeoutException: Message: timeout: Timed out receiving message from renderer: 600.000
. This is probably due to too many JS commands. Add those options
chrome_options.add_argument('--dns-prefetch-disable')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--enable-cdp-events')