I have a Python project with several modules, classes, and dependencies files (a requirements.txt
file). I want to pack it into one file with all the dependencies and give the file path to AWS EMR serverless, which will run it.
The problem is that I don't understand how to pack a Python project with all the dependencies, which file the EMR can consume, etc. All the examples I have found used one Python file.
In simple words, what should I do if my Python project is not a single file but is more complex?
There's a few ways to do this with EMR Serverless. Regardless of which way you choose, you will need to provide a main entrypoint Python script to the EMR Serverless StartJobRun command.
Let's assume you've got a job structure like this where main.py
is your entrypoint that creates a Spark session and runs your jobs and job1
and job2
are your local modules.
├── jobs
│ └── job1.py
│ └── job2.py
├── main.py
├── requirements.txt
--py-files
with your zipped local modules and --archives
with a packaged virtual environment for your external dependencieszip -r job_files.zip jobs
venv-pack
with your dependencies.Note: This has to be done with a similar OS and Python version as EMR Serverless, so I prefer using a multi-stage Dockerfile with custom outputs.
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
COPY requirements.txt .
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install -r requirements.txt
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /
If you run DOCKER_BUILDKIT=1 docker build --output . .
, you should now have a pyspark_deps.tar.gz
file on your local system.
Upload main.py
, job_files.zip
, and pyspark_deps.tar.gz
to a location on S3.
Run your EMR Serverless job with a command like this (replacing APPLICATION_ID
, JOB_ROLE_ARN
, and YOUR_BUCKET
):
aws emr-serverless start-job-run \
--application-id $APPLICATION_ID \
--execution-role-arn $JOB_ROLE_ARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<YOUR_BUCKET>/main.py",
"sparkSubmitParameters": "--py-files s3://<YOUR_BUCKET>/job_files.zip --conf spark.archives=s3://<YOUR_BUCKET>/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
}
}'
--archives
with a packaged virtual environmentThis is probably the most reliable way, but it will require you to use setuptools. You can use a simple pyproject.toml
file along with your existing requirements.txt
[project]
name = "mysparkjobs"
version = "0.0.1"
dynamic = ["dependencies"]
[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}
You then can use a multi-stage Dockerfile and custom build outputs to package your modules and dependencies into a virtual environment.
Note: This requires you to enable Docker Buildkit
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
WORKDIR /app
COPY . .
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install .
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /
Now you can run DOCKER_BUILDKIT=1 docker build --output . .
and a pyspark_deps.tar.gz
file will be generated with all your dependencies. Upload this file and your main.py
script to S3.
Assuming you uploaded both files to s3://<YOUR_BUCKET>/code/pyspark/myjob/
, run the EMR Serverless job like this (replacing the APPLICATION_ID
, JOB_ROLE_ARN
, and YOUR_BUCKET
:
aws emr-serverless start-job-run \
--application-id <APPLICATION_ID> \
--execution-role-arn <JOB_ROLE_ARN> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<YOUR_BUCKET>/code/pyspark/myjob/main.py",
"sparkSubmitParameters": "--conf spark.archives=s3://<YOUR_BUCKET>/code/pyspark/myjob/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
}
}'
Note the additional sparkSubmitParameters
that specify your dependencies and configure the driver and executor environment variables for the proper paths to python
.