amazon-web-servicesdockerapache-sparkamazon-emrspark-submit

EMR Spark deploy mode when using Docker


I am deploying a spark job in AWS EMR and packaging all my dependencies using docker. My pythonized spark submit command looks like this

    ...
    cmd = (
            f"spark-submit --deploy-mode cluster "
            f"spark-submit --deploy-mode {deploy_mode} "
            f"--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker "
            f"--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={docker_image} "
            f"--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG={config} "
            f"--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro "
            f"--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker "
            f"--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={docker_image} "
            f"--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG={config} "
            f"--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro "
            f"{path}"
        )
    ...

It worked as expected when my deploy_mode is cluster but I don't see any of my docker dependency when deploy_mode is client. Can anyone help why this is happening and is it normal?


Solution

  • The docker containers are managed by Yarn on EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html

    In client mode, your Spark driver doesn't run in a docker container because that process is not managed by Yarn, it is directly executed on the node that runs the spark-submit command. In cluster mode your driver is managed by Yarn and as so executed inside a docker container.