amazon-web-servicesapache-sparkamazon-s3pysparkamazon-emr

spark-submit using --py-files option could not find path to modules


I am trying to submit a pyspark job in EMR cluster. The code for job lies in a zipped package that is placed in S3 :

/bin/spark-submit \
  --py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip \
  pipeline.job_1.job_1.py -h 

This is my pipeline.zip structure

$ unzip -L pipeline.zip 
Archive:  pipeline.zip
 extracting: pipeline/__init__.py     
   creating: pipeline/common/
  inflating: pipeline/common/__init__.py  
  inflating: pipeline/common/error_message.py  
   creating: pipeline/job_1/
  inflating: pipeline/job_1/__init__.py  
  inflating: pipeline/job_1/job_1.py  
   creating: pipeline/job_2/
  inflating: pipeline/job_2/__init__.py  
  inflating: pipeline/job_2/job_2.py

The zipped package is then placed in

s3://my-dev/scripts/job-launchers/dev/

$ aws s3 ls s3://my-dev/scripts/job-launchers/dev/pipeline.zip
2024-10-11 17:54:28      13219 pipeline.zip

After submitting the job I get Error:

/usr/bin/python3: can't open file '/home/hadoop/pipeline.job_1.job_1.py': [Errno 2] No such file or directory

It seems the zip file is not getting placed in the PYTHONPATH. Any pointers to troubleshoot will be of great help.


Solution

  • I think it's more common to have required dependencies in the zip file, then you can have the main python file outside the zip and call it like so

    spark-submit --deploy-mode cluster --master yarn --py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip s3://my-dev/scripts/job-launchers/dev/job.py