amazon-emremr-serverless

How to run a Python project (package) on AWS EMR serverless?


I have a Python project with several modules, classes, and dependencies files (a requirements.txt file). I want to pack it into one file with all the dependencies and give the file path to AWS EMR serverless, which will run it.

The problem is that I don't understand how to pack a Python project with all the dependencies, which file the EMR can consume, etc. All the examples I have found used one Python file.

In simple words, what should I do if my Python project is not a single file but is more complex?


Solution

  • There's a few ways to do this with EMR Serverless. Regardless of which way you choose, you will need to provide a main entrypoint Python script to the EMR Serverless StartJobRun command.

    Let's assume you've got a job structure like this where main.py is your entrypoint that creates a Spark session and runs your jobs and job1 and job2 are your local modules.

    ├── jobs
    │   └── job1.py
    │   └── job2.py
    ├── main.py
    ├── requirements.txt
    

    Option 1. Use --py-files with your zipped local modules and --archives with a packaged virtual environment for your external dependencies

    zip -r job_files.zip jobs
    

    Note: This has to be done with a similar OS and Python version as EMR Serverless, so I prefer using a multi-stage Dockerfile with custom outputs.

    FROM --platform=linux/amd64 amazonlinux:2 AS base
    
    RUN yum install -y python3
    
    ENV VIRTUAL_ENV=/opt/venv
    RUN python3 -m venv $VIRTUAL_ENV
    ENV PATH="$VIRTUAL_ENV/bin:$PATH"
    
    COPY requirements.txt .
    
    RUN python3 -m pip install --upgrade pip && \
        python3 -m pip install venv-pack==0.2.0 && \
        python3 -m pip install -r requirements.txt
    
    RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
    
    FROM scratch AS export
    COPY --from=base /output/pyspark_deps.tar.gz /
    

    If you run DOCKER_BUILDKIT=1 docker build --output . ., you should now have a pyspark_deps.tar.gz file on your local system.

    aws emr-serverless start-job-run \
        --application-id $APPLICATION_ID \
        --execution-role-arn $JOB_ROLE_ARN \
        --job-driver '{
            "sparkSubmit": {
                "entryPoint": "s3://<YOUR_BUCKET>/main.py",
                "sparkSubmitParameters": "--py-files s3://<YOUR_BUCKET>/job_files.zip --conf spark.archives=s3://<YOUR_BUCKET>/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
            }
        }'
    

    Option 2. Package your local modules as a Python library and use --archives with a packaged virtual environment

    This is probably the most reliable way, but it will require you to use setuptools. You can use a simple pyproject.toml file along with your existing requirements.txt

    [project]
    name = "mysparkjobs"
    version = "0.0.1"
    dynamic = ["dependencies"]
    [tool.setuptools.dynamic]
    dependencies = {file = ["requirements.txt"]}
    

    You then can use a multi-stage Dockerfile and custom build outputs to package your modules and dependencies into a virtual environment.

    Note: This requires you to enable Docker Buildkit

    FROM --platform=linux/amd64 amazonlinux:2 AS base
    
    RUN yum install -y python3
    
    ENV VIRTUAL_ENV=/opt/venv
    RUN python3 -m venv $VIRTUAL_ENV
    ENV PATH="$VIRTUAL_ENV/bin:$PATH"
    
    WORKDIR /app
    COPY . .
    RUN python3 -m pip install --upgrade pip && \
        python3 -m pip install venv-pack==0.2.0 && \
        python3 -m pip install .
    
    RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
    
    FROM scratch AS export
    COPY --from=base /output/pyspark_deps.tar.gz /
    

    Now you can run DOCKER_BUILDKIT=1 docker build --output . . and a pyspark_deps.tar.gz file will be generated with all your dependencies. Upload this file and your main.py script to S3.

    Assuming you uploaded both files to s3://<YOUR_BUCKET>/code/pyspark/myjob/, run the EMR Serverless job like this (replacing the APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET:

    aws emr-serverless start-job-run \
        --application-id <APPLICATION_ID> \
        --execution-role-arn <JOB_ROLE_ARN> \
        --job-driver '{
            "sparkSubmit": {
                "entryPoint": "s3://<YOUR_BUCKET>/code/pyspark/myjob/main.py",
                "sparkSubmitParameters": "--conf spark.archives=s3://<YOUR_BUCKET>/code/pyspark/myjob/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
            }
        }'
    

    Note the additional sparkSubmitParameters that specify your dependencies and configure the driver and executor environment variables for the proper paths to python.