Does anyone know where I can find a docker image for GCP's dataproc? I've worked with dataproc clusters and found them to be quite good but I'd like to develop locally and only move my compute to the cloud when I'm ready to handle a large job. I've found some docker images that work with pyspark but I would love to get something that worked as smoothly as GCP dataproc.
You can find the base images in this cloud-dataproc Container Registry, and these images are built on top of the Compute Image OS. From there you can use the pull command to get the Dataproc base image locally.
You can use the base image under the spark folder. The required base image can be pulled and a spark job can be run on the Dataproc image using the following commands. I experimented with the Dataproc 2.0 image but other versions can be found in the same folder.
# Pulling the required image
docker pull gcr.io/cloud-dataproc/spark/dataproc_2.0:preview-0.3
# Sample PySpark job
sudo docker run -v /home/sample-spark-app:/home/sample-spark-app d4e6c561de5b spark-submit --master local[4] /home/sample-spark-app/pi.py
# Sample Spark (Java API) job
sudo docker run -v /home/sample-spark-app:/home/sample-spark-app d4e6c561de5b spark-submit --class "JavaSparkPi" --master local[4] /home/sample-spark-app/target/simple-project-1.0.jar
If you want to use other features on top of the base image, please look into the other spark images under gcr.io/cloud-dataproc.