amazon-web-servicesselenium-webdriverweb-scrapingaws-lambda

AWS Lambda webscraping through a docker image


I'm learning AWS Lambda and I'm trying to implement a webscraping program. I created my Lambda function through a container image, that I built through Docker. My project folder has three files:

  1. Dockerfile
FROM public.ecr.aws/lambda/python:3.13

# Copy requirements.txt
COPY requirements.txt ${LAMBDA_TASK_ROOT}

# Install the specified packages
RUN pip install -r requirements.txt


#Install linux chrome and chromedriver
RUN dnf install -y unzip && \
    curl -Lo "/tmp/chromedriver-linux64.zip" "https://storage.googleapis.com/chrome-for-testing-public/132.0.6834.159/linux64/chromedriver-linux64.zip" && \
    curl -Lo "/tmp/chrome-linux64.zip" "https://storage.googleapis.com/chrome-for-testing-public/132.0.6834.159/linux64/chrome-linux64.zip" && \
    unzip /tmp/chromedriver-linux64.zip -d /opt/ && \
    unzip /tmp/chrome-linux64.zip -d /opt/

RUN dnf install -y atk cups-libs gtk3 libXcomposite alsa-lib \
    libXcursor libXdamage libXext libXi libXrandr libXScrnSaver \
    libXtst pango at-spi2-atk libXt xorg-x11-server-Xvfb \
    xorg-x11-xauth dbus-glib dbus-glib-devel nss mesa-libgbm


# Copy function code
COPY lambda_function.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "lambda_function.handler" ]
  1. lambda_function.py
import sys
import json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options


def handler(event,context):
    print("Hello welt")

    PATH = "/opt/chrome-linux64/chrome"

    chrome_options = Options()
    chrome_options.add_argument('--headless')  # Headless mode
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument("--single-process")
    chrome_options.binary_location = "opt/chromedriver-linux64/chromedriver"

    cService = webdriver.ChromeService(executable_path= PATH)

    chrome = webdriver.Chrome(service=cService, options=chrome_options)
    chrome.get("https://cloudbytes.dev/")
    description = chrome.find_element(By.NAME, "description").get_attribute("content")
    print(description)


    return {
        "statusCode": 200,
        "body": json.dumps(
            {
                "message": description,
            }
        ),
    }
  1. requirements.txt
selenium==4.27.1
boto3

I pushed my code into a ECR repository as per the push command instructions with no problems. I created the AWS Lambda function through my image container from the ECR and finally I configured it for 512MB memory and 1,5min timeout.

When I run a test, after a few seconds I get the following error (full logs provided):

1.START RequestId: 85935e3f-c6c9-40f3-ae83-67dec9372e80 Version: $LATEST
2.Hello welt
3.LAMBDA_WARNING: Unhandled exception. The most likely cause is an issue in the function code. However, in rare cases, a Lambda runtime update can cause unexpected function behavior. For functions using managed runtimes, runtime updates can be triggered by a function change, or can be applied automatically. To determine if the runtime has been updated, check the runtime version in the INIT_START log entry. If this error correlates with a change in the runtime version, you may be able to mitigate this error by temporarily rolling back to the previous runtime version. For more information, see https://docs.aws.amazon.com/lambda/latest/dg/runtimes-update.html
[ERROR] WebDriverException: Message: Service /opt/chrome-linux64/chrome unexpectedly exited. Status code was: -5
{
  "errorMessage": "Message: Service /opt/chrome-linux64/chrome unexpectedly exited. Status code was: -5\n",
  "errorType": "WebDriverException",
  "requestId": "308ce479-715f-4f46-b2f2-4cd1300c4b6f",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 23, in handler\n    chrome = webdriver.Chrome(service=cService, options=chrome_options)\n",
    "  File \"/var/lang/lib/python3.13/site-packages/selenium/webdriver/chrome/webdriver.py\", line 45, in __init__\n    super().__init__(\n",
    "  File \"/var/lang/lib/python3.13/site-packages/selenium/webdriver/chromium/webdriver.py\", line 55, in __init__\n    self.service.start()\n",
    "  File \"/var/lang/lib/python3.13/site-packages/selenium/webdriver/common/service.py\", line 108, in start\n    self.assert_process_still_running()\n",
    "  File \"/var/lang/lib/python3.13/site-packages/selenium/webdriver/common/service.py\", line 121, in assert_process_still_running\n    raise WebDriverException(f\"Service {self._path} unexpectedly exited. Status code was: {return_code}\")\n"
  ]
}
4.END RequestId: 85935e3f-c6c9-40f3-ae83-67dec9372e80
5.REPORT RequestId: 85935e3f-c6c9-40f3-ae83-67dec9372e80    Duration: 4452.94 ms    Billed Duration: 4790 ms    Memory Size: 512 MB Max Memory Used: 222 MB Init Duration: 336.97 ms    

I don't understand where it is going wrong. Any help is welcome.


Solution

  • The binary location and driver paths got mixed up. The driver path should be

    PATH = "/opt/chromedriver-linux64/chromedriver"
    

    and the binary location path should be

    chrome_options.binary_location = "/opt/chrome-linux64/chrome"