I came across a spark code, which runs on GCP dataproc , reading and writing data to GCS. The code has below spark configurations.
spark_session.sparkContext._conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark_session.sparkContext._conf.set("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark_session.sparkContext._conf.set("fs.gs.auth.service.account.enable", "true")
spark_session.sparkContext._conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark_session.sparkContext._conf.set("fs.gs.project.id", "<val>")
spark_session.sparkContext._conf.set("fs.gs.auth.service.account.email", "<val>")
spark_session.sparkContext._conf.set("fs.gs.auth.service.account.private.key.id", "<val>"])
spark_session.sparkContext._conf.set("fs.gs.auth.service.account.private.key", "<val>")
Question:
The GCS configs are optional. By default, in Dataproc clusters, the GCS connector automatically uses the VM's service account to authenticate to GCS. When the GCS auth properties are specified, the connector uses the user-specified service account instead. Note that fs.gs.auth.service.account.enable
and some other auth properties are only available in GCS connector v2, see this doc. In v3, more auth types are supported, and a new property fs.gs.auth.type
is introduced to explicitly specify the auth type, see this doc.
Both can be used to configure Spark properties, but SparkContext was introduced since the beginning, and Spark Session was introduced in Spark 2.0 as a replacement for the earlier Spark Context and SQL Context APIs. So SparkSession is preferred. See this article.