I'm trying to run some jobs on aws cli using a virtual environment where I installed some libraries. I followed this guide; the same is here. But when I run the job I have this error:
Job execution failed, please check complete logs in configured logging destination. ExitCode: 1. Last few exceptions: Caused by: java.io.IOException: error=2, No such file or directory Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python"
I also tried /home/hadoop/environment/bin/python
as path but I obtain the same result.
My job conf are:
--conf spark.archives=s3://mybucket/dependencies/myenv.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python
If I run in the job
os.listdir("./environment/bin/)
The result is
['python3.9', 'pip', 'pip3.9', 'rst2xetex.py', 'rstpep2html.py', 'f2py3', 'rst2latex.py', 'f2py', 'rst2odt.py', 'rst2html4.py', 'pip3', 'aws', 'python3', 'jp.py', 'rst2odt_prepstyles.py', 'pyrsa-encrypt', 'activate', 'rst2man.py', 'pyrsa-priv2pub', 'python', 'pyrsa-keygen', 'pyrsa-verify', 'rst2html.py', 'aws_completer', 'f2py3.9', 'venv-pack', 'rst2pseudoxml.py', 'aws_bash_completer', 'aws_zsh_completer.sh', 'aws.cmd', 'rst2s5.py', 'rst2xml.py', 'pyrsa-decrypt', 'rst2html5.py', 'Activate.ps1', '__pycache__', 'pyrsa-sign']
So the path should be correct. I also tried to set the PYSPARK_DRIVER_PYTHON inside the script as
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
os.environ['PYSPARK_DRIVER_PYTHON'] = "./environment/bin/python"
But in this case the error is when I import libraries that I installed in the virtualenv, so it runs the script with the standard python.
Can you help me?
The problem is that you probably haven't used Amazon Linux 2 to create the venv. Using Amazon Linux and Python 3.7.10 did it for me.
As detailed here you can use similar to this docker file to generate such a venv. you better use a requirements.txt
to make it more reusable but it gives you the idea.
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install \
great_expectations==0.15.6 \
venv-pack==0.2.0
RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_ge.tar.gz /