dockerapache-sparkpysparkamazon-emrapache-zeppelin

Use pyspark shell or Zeppelin with Docker for EMR


I'm using docker as the yarn container runtime for EMR. To submit a step to the cluster I do this

spark-submit
--deploy-mode cluster \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
recipe.py

If I want to open up a pyspark shell or use Zeppelin with docker as the yarn container runtime, how do I do it? If I set the same configuration options for the pyspark shell, it doesn't seem able to find my libraries installed on the docker image.

PYSPARK_PYTHON=ipython \
PYSPARK_DRIVER_PYTHON=ipython \
pyspark \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME

Solution

    1. It is not possible to use docker with the pyspark shell because the docker runtime requires deploy-mode=cluster and the pyspark shell can only run in client mode.

    2. The docker runtime can be used in Zeppelin by adding the following configurations to the spark interpreter

      spark.submit.deployMode=cluster spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=[image path] spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=[image path]