I am trying to submit a pyspark job in EMR cluster. The code for job lies in a zipped package that is placed in S3 :
/bin/spark-submit \
--py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip \
pipeline.job_1.job_1.py -h
This is my pipeline.zip structure
$ unzip -L pipeline.zip
Archive: pipeline.zip
extracting: pipeline/__init__.py
creating: pipeline/common/
inflating: pipeline/common/__init__.py
inflating: pipeline/common/error_message.py
creating: pipeline/job_1/
inflating: pipeline/job_1/__init__.py
inflating: pipeline/job_1/job_1.py
creating: pipeline/job_2/
inflating: pipeline/job_2/__init__.py
inflating: pipeline/job_2/job_2.py
The zipped package is then placed in
s3://my-dev/scripts/job-launchers/dev/
$ aws s3 ls s3://my-dev/scripts/job-launchers/dev/pipeline.zip
2024-10-11 17:54:28 13219 pipeline.zip
After submitting the job I get Error:
/usr/bin/python3: can't open file '/home/hadoop/pipeline.job_1.job_1.py': [Errno 2] No such file or directory
It seems the zip file is not getting placed in the PYTHONPATH. Any pointers to troubleshoot will be of great help.
I think it's more common to have required dependencies in the zip file, then you can have the main python file outside the zip and call it like so
spark-submit --deploy-mode cluster --master yarn --py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip s3://my-dev/scripts/job-launchers/dev/job.py