apache-sparkgoogle-cloud-platformgoogle-cloud-dataproc

How to do a clean installation of an upgraded version of Spark on Dataproc


I created a Dataproc cluster using the 2.1 image which comes with Spark 3.3.2. I am planning to perform a clean upgrade to Spark 3.5 and had a few queries:

  1. What are the recommended steps for upgrading Apache Spark from version 3.3 to 3.5?
    I am currently doing a wget of the tgz followed by changing the PATH and SPARK_HOME. While this allows me to run the specific 3.5 binary, I'm not sure how to use the 3.5 binary while performing a gcloud dataproc jobs submit.

  2. Are there any specific considerations or potential pitfalls I should be aware of during the upgrade process?
    This is especially for scenarios where Dataproc tooling is available. For e.g., the GCS connector availability is one thing that comes to mind. Are there any other things I may be missing out with a custom install vs the pre-packaged Spark?

  3. What are the changes in configuration or dependencies that I need to address while upgrading? Other than SPARK_HOME. Copying the /etc/spark/conf is one thing that comes to mind. Is there anything else?
    How do I ensure that the Spark UI on CG points to the upgraded version?

  4. Are there any known issues or incompatibilities that I should watch out for when installing custom Spark on Dataproc?


Solution

  • Dataproc ships its own fork of Spark which keeps the Spark API but has lots of internal changes. Customizing Spark by user is not recommended because of the following reasons:

    1. There are many bug fixes, security patches, plugins (e.g., metrics listeners), features and performance optimizations in the Dataproc fork. For example, the EFM feature won't work with the vanilla Spark.

    2. The Spark on Dataproc has been built to be compatible with other components, including Java, Scala, Hadoop/Hive, GCS/BigQuery connector, etc. Especially BigQuery connector is Spark version specific, the latest 0.34.0 doesn't support 3.5 yet.

    3. Issues with customized Spark by the user won't get support from the Dataproc team.

    If you are looking to use Spark 3.5, I'd suggest 2 options:

    1. Wait for the newer minor version of Dataproc which includes Spark 3.5;

    2. Set up your own Spark cluster on top of GCE or GKE without Dataproc.

    In general, Dataproc supports customization through init actions or custom images, but they should be limited to configs, dependencies, plugins. Modifying a component itself is complicated and might run into many issues hence not recommended.

    Going back to your questions:

    q1. Spark binaries are installed under /usr/lib/spark and configs are under /etc/spark. You can simply replace binaries in /usr/lib/spark. It's also okay to download your Spark binaries into another dir and update SPARK_HOME in /etc/environment and PATH to include ${SPARK_HOME}/bin. Jobs submitted through gcloud dataproc jobs submit will be fetched by the Dataproc Agent on the master node, which will in turn use spark-submit to submit the job to YARN, so you just need to make sure the command spark-submit points to your Spark.

    q3. Spark UI has been integrated with YARN UI (it can be linked from YARN UI), it should work automatically, no additional work is needed when customizing Spark.