scalaapache-sparkjupyter-notebookgoogle-cloud-dataprocapache-toree

Running Spark + Scala + Jupyter on Dataproc


I haven't yet managed to get Spark, Scala, and Jupyter to co-operate. Does anyone have a simple recipe? Which version of each component did you use?


Solution

  • Apache Toree is compatible with DataProc's 1.0 image, which currently includes Spark 1.6.1. I had unsuccessfully tried to use it with the preview image, which includes Spark 2.0 preview. To install Toree on the DataProc master you can run

    sudo apt install python3-pip
    pip3 install --user jupyter
    export SPARK_HOME=/usr/lib/spark
    pip3 install --pre --user toree
    export PATH=$HOME/.local/bin:$PATH
    jupyter toree install --user --spark_home=$SPARK_HOME