apache-sparkgoogle-cloud-platformpysparkgoogle-cloud-dataproc

Hadoop fs configurations in Dataproc spark code


I came across a spark code, which runs on GCP dataproc , reading and writing data to GCS. The code has below spark configurations.

spark_session.sparkContext._conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark_session.sparkContext._conf.set("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark_session.sparkContext._conf.set("fs.gs.auth.service.account.enable", "true")
spark_session.sparkContext._conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark_session.sparkContext._conf.set("fs.gs.project.id", "<val>")
spark_session.sparkContext._conf.set("fs.gs.auth.service.account.email", "<val>")
spark_session.sparkContext._conf.set("fs.gs.auth.service.account.private.key.id", "<val>"])
spark_session.sparkContext._conf.set("fs.gs.auth.service.account.private.key", "<val>")

Question:

  1. why do we need to set above Hadoop related configurations, can we not directly read data from cloud storage from spark using spark.read() as long as Servcie account tagged to Dataproc has required access?
  2. why do we need to use spark_session.sparkContext._conf.set(), can we not use spark_session.conf.set()?

Solution

    1. The GCS configs are optional. By default, in Dataproc clusters, the GCS connector automatically uses the VM's service account to authenticate to GCS. When the GCS auth properties are specified, the connector uses the user-specified service account instead. Note that fs.gs.auth.service.account.enable and some other auth properties are only available in GCS connector v2, see this doc. In v3, more auth types are supported, and a new property fs.gs.auth.type is introduced to explicitly specify the auth type, see this doc.

    2. Both can be used to configure Spark properties, but SparkContext was introduced since the beginning, and Spark Session was introduced in Spark 2.0 as a replacement for the earlier Spark Context and SQL Context APIs. So SparkSession is preferred. See this article.