I am trying to create a dataproc cluster and point to a remote Hive metastore in order to access the Hive tables from this cluster. I am using the below create cluster command to create a Dataproc 2.0 cluster with connection to remote Hive metastore:
Create Cluster Command:
gcloud dataproc clusters create wl1-cluster-1 \
--region us-east1 \
--subnet projects/shared-vpc-admin/regions/us-east1/subnetworks/dev-us-east1-01 \
--enable-component-gateway \
--no-address \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--master-machine-type n1-standard-32 \
--master-boot-disk-size 1000 \
--num-workers 10 \
--worker-machine-type n1-standard-32 \
--worker-boot-disk-size 1000 \
--image-version 2.0-debian10 \
--properties 'spark:spark.sql.hive.metastore.version=2.3.0,spark:spark.sql.hive.metastore.jars=maven,hive:hive.metastore.schema.verification=false,hive:javax.jdo.option.ConnectionURL=jdbc:mysql://test-mysql.gcp-dev.glb.us.mycompany.net:3306/metastore,hive:javax.jdo.option.ConnectionUserName=metastore,hive:javax.jdo.option.ConnectionPassword=XXXX' \
--project sample_gcp_project
But it is failing during the step activate-component-hive-metastore.
Error Message:
<13>Sep 10 05:47:06 google-dataproc-startup[1734]: <13>Sep 10 05:47:06 activate-component-hive-metastore[3064]: nc: connect to wl1-cluster-1-m port 9083 (tcp) failed: Connection refused
<13>Sep 10 05:47:07 google-dataproc-startup[1734]: <13>Sep 10 05:47:07 activate-component-hive-metastore[3064]: nc: connect to wl1-cluster-1-m port 9083 (tcp) failed: Connection refused
<13>Sep 10 05:47:07 google-dataproc-startup[1734]: <13>Sep 10 05:47:07 activate-component-hive-metastore[3064]: 'nc -v -z -w 1 wl1-cluster-1-m 9083' attempt 23 failed! Sleeping 1s.
Few questions:
By default, there is a Metastore service running on the (1st) master node of a Dataproc cluster, the default port is 9083, and Hive Server2 is configured to use it. In /etc/hive/conf/hive-site.xml
, you will find:
<property>
<name>hive.metastore.uris</name>
<value>thrift://<master-hostname>:9083</value>
</property>
If you need the HiveServer2 of your cluster:
to use the Metastore of another cluster, set --properties hive:hive.metastore.uris=thrift://<another-cluster-master-hostname>:9083
.
or to use the managed Dataproc Metastore Service, follow this instruction, for example:
gcloud dataproc clusters create example-cluster \
--dataproc-metastore=projects/PROJECT_ID/locations/LOCATION/services/example-service \
--region=LOCATION
The clusters/Metastore need to be in the same VPC network to communicate with each other through internal IP (or hostname).
Spark in the cluster is automatically configured to use the Hive config hive.metastore.uris
to find the Metastore, you don't need any additional config in your Spark code.