I'm new to Dataproc and am having trouble running a job that accesses a PostgreSQL database (Compute Engine VM) to write data to BigQuery.
I created a cluster with the following configuration:
gcloud dataproc clusters create NAME-CLUSTER \
--enable-component-gateway \
--bucket STAGING-BUCKET \
--region southamerica-east1 \
--subnet default \
--public-ip-address \
--master-machine-type e2-standard-2 \
--master-boot-disk-size 100 \
--num-workers 2 \
--worker-machine-type e2-standard-2 \
--worker-boot-disk-size 200 \
--image-version 2.2-ubuntu22 \
--properties dataproc:conda.packages=google-cloud-secret-manager==2.24.0,spark:spark.jars=https://jdbc.postgresql.org/download/postgresql-42.7.7.jar \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-southamerica-east1/connectors/connectors.sh' \
--metadata spark-bigquery-connector-version=0.42.2 \
--project PROJECT_ID
Quoting the last two commands to run the job:
#1
gcloud dataproc jobs submit pyspark \
--cluster NAME-CLUSTER \
gs://BUCKET/initial_load.py \
--region southamerica-east1 \
--files=gs://BUCKET/config.json \
--jars=gs://BUCKET/jars/spark-3.5-bigquery-0.42.2.jar \
--project PROJECT_ID
#2
gcloud dataproc jobs submit pyspark \
--cluster NAME-CLUSTER \
gs://BUCKET/initial_load.py \
--region southamerica-east1 \
--files=gs://BUCKET/config.json \
--jars=gs://BUCKET/jars/spark-bigquery-with-dependencies_2.13-0.42.2.jar \
--project PROJECT_ID
Both returned the following error:
25/09/15 19:27:54 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-sa-east1-454845955091-uyoz8e6n/27d1a5a7-6e7e-4247-83e2-5a59bc27a244/spark-job-history/application_1757961849781_0003.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]
Iniciando a leitura dos dados do PostgreSQL...
Traceback (most recent call last):
File "/tmp/with-dependencies/initial_load.py", line 74, in <module>
).load()
^^^^^^
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: com.google.cloud.spark.bigquery.BigQueryRelationProvider not a subtype
I'm not sure if it's an error in my code or the cluster configuration. Could someone please help me?
Thank you.
I created and recreated the cluster with updated BigQuery connector versions, changed the job calls to see if I got different results, but in neither case it worked.
For the error that you got, it seems you reload the BigQuery connector twice, as stated here the connectors for Spark BigQuery are pre-installed already in Dataproc 2.1 and later image. It is automatically added when you flag it:
(--metadata spark-bigquery-connector-version=0.42.X)
You can try removing those JARs (--jars=gs://BUCKET/jars/spark-3.5-bigquery-0.42.2.jar) that you are loading manually, since the BigQuery connector you need is already defined in your configuration. This should help resolve the error you are encountering. -as my answer also in Google forum