apache-sparkgoogle-cloud-platformpysparkgoogle-bigquerydataproc

DATAPROC - com.google.cloud.spark.bigquery.BigQueryRelationProvider not a subtype


I'm new to Dataproc and am having trouble running a job that accesses a PostgreSQL database (Compute Engine VM) to write data to BigQuery.

I created a cluster with the following configuration:

gcloud dataproc clusters create NAME-CLUSTER \
    --enable-component-gateway \
    --bucket STAGING-BUCKET \
    --region southamerica-east1 \
    --subnet default \
    --public-ip-address \
    --master-machine-type e2-standard-2 \
    --master-boot-disk-size 100 \
    --num-workers 2 \
    --worker-machine-type e2-standard-2 \
    --worker-boot-disk-size 200 \
    --image-version 2.2-ubuntu22 \
    --properties dataproc:conda.packages=google-cloud-secret-manager==2.24.0,spark:spark.jars=https://jdbc.postgresql.org/download/postgresql-42.7.7.jar \
    --scopes 'https://www.googleapis.com/auth/cloud-platform' \
    --initialization-actions 'gs://goog-dataproc-initialization-actions-southamerica-east1/connectors/connectors.sh' \
    --metadata spark-bigquery-connector-version=0.42.2 \
    --project PROJECT_ID

Quoting the last two commands to run the job:

#1
gcloud dataproc jobs submit pyspark \
    --cluster NAME-CLUSTER \
    gs://BUCKET/initial_load.py \
    --region southamerica-east1 \
    --files=gs://BUCKET/config.json \
    --jars=gs://BUCKET/jars/spark-3.5-bigquery-0.42.2.jar \
    --project PROJECT_ID 
    
#2
gcloud dataproc jobs submit pyspark \
    --cluster NAME-CLUSTER \
    gs://BUCKET/initial_load.py \
    --region southamerica-east1 \
    --files=gs://BUCKET/config.json \
    --jars=gs://BUCKET/jars/spark-bigquery-with-dependencies_2.13-0.42.2.jar \
    --project PROJECT_ID 

Both returned the following error:

25/09/15 19:27:54 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-sa-east1-454845955091-uyoz8e6n/27d1a5a7-6e7e-4247-83e2-5a59bc27a244/spark-job-history/application_1757961849781_0003.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]
Iniciando a leitura dos dados do PostgreSQL...
Traceback (most recent call last):
  File "/tmp/with-dependencies/initial_load.py", line 74, in <module>
    ).load()
      ^^^^^^
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: com.google.cloud.spark.bigquery.BigQueryRelationProvider not a subtype

I'm not sure if it's an error in my code or the cluster configuration. Could someone please help me?

Thank you.

I created and recreated the cluster with updated BigQuery connector versions, changed the job calls to see if I got different results, but in neither case it worked.


Solution

  • For the error that you got, it seems you reload the BigQuery connector twice, as stated here the connectors for Spark BigQuery are pre-installed already in Dataproc 2.1 and later image. It is automatically added when you flag it:

    (--metadata spark-bigquery-connector-version=0.42.X)

    You can try removing those JARs (--jars=gs://BUCKET/jars/spark-3.5-bigquery-0.42.2.jar) that you are loading manually, since the BigQuery connector you need is already defined in your configuration. This should help resolve the error you are encountering. -as my answer also in Google forum