pythonamazon-web-servicesapache-sparkemr

How to bootstrap installation of Python modules on Amazon EMR?


I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this?


Solution

  • The most straightforward way would be to create a bash script containing your installation commands, copy it to S3, and set a bootstrap action from the console to point to your script.

    Here's an example I'm using in production:

    s3://mybucket/bootstrap/install_python_modules.sh

    #!/bin/bash -xe
    
    # Non-standard and non-Amazon Machine Image Python modules:
    sudo pip install -U \
      awscli            \
      boto              \
      ciso8601          \
      ujson             \
      workalendar
    
    sudo yum install -y python-psycopg2