postgresqlgoogle-cloud-platformpysparkjdbcdataproc

How to enable outside connection before submit Pyspark job to Dataproc


I have a Pyspark file which will be submitted to Dataproc.

try:
    print("Start writing")
    url = "jdbc:postgresql://some-ip:5432/postgres"
    properties = {
        "driver": "org.postgresql.Driver",
        "user": "postgres",
        "password": "root"
    }
    df.write.jdbc(url=url, table="result", mode="overwrite", properties=properties)

except Exception as e:
    print(e)
    sc.stop()

I use postgresql-42.6.0.jar JDBC driver and my database is postgresql 14.

Here is the error.

An error occurred while calling o86.jdbc.
: org.postgresql.util.PSQLException: The connection attempt failed.
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:331)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:247)
        at org.postgresql.Driver.makeConnection(Driver.java:434)
        at org.postgresql.Driver.connect(Driver.java:291)
        at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
...
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
...

Here is how I submit my job through google cloud shell

gcloud beta dataproc jobs submit pyspark gs://taro-de-intern/pyspark_postgresql.py\
  --cluster my-cluster \
  --jars gs://my-bucket/postgresql-42.6.0.jar

I suspect that it has something to do with driver so I downgrade my jar file version to 42.4.2. But it didn't work and yield the same error.

I even tried to change the format to

df.write \
.format("jdbc") \
.option("driver", "org.postgresql.Driver") \
.option("url", "jdbc:postgresql://some-ip:5432/postgres") \
.option("dbtable", "schema.result") \
.option("user", "postgres") \
.option("password", "root") \
.save()

also yield the same error


Solution

  • I already sort it out so here is the solution. If you are using any cloud database(SQL instance on GCP, AWS, Azure)

    Don't forget to allow outside connection

    Here is where you can enable outside connection on GCP cloud SQL instance.

    1. Go to edit in your SQL instance overview

    2. Go to connections

    3. Add network(You won't have the allow all network when you first open) by entering the name of your connection(name doesn't matter) and your IP address.
      For more information please visit Microsoft website on subnet mask.
      Note: This is for the sake of example so don't allow all connection (0.0.0.0) in the real production.
      enter image description here

    4. Scroll down to the bottom and click save.