postgresqldockerscalacontainers

Scala spark is not able to connect to postgress table or postgres hive_metastore


I am trying to get Scala Spark to connect to a Postgres database without success; without using docker (that is with everything installed locally) it works as expected. I have PostgreSQL-42.7.5.jar in the Sparks jar folder. I am able to connect to the database via pgAdmin4 and browse the tables therefore it is not a connection issue with the database itself. Yet, once everything is put into a docker image it does not work.

spark is running under docker; so is Jupyter notebook.

postgres is running locally under macOS.

To reporuduce the issue start the docker image here: https://github.com/JosephMRally/spark-dp-101/tree/loadRealEstateData

enter image description here

org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
  at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:352)
  at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
  at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:273)
  at org.postgresql.Driver.makeConnection(Driver.java:446)
  at org.postgresql.Driver.connect(Driver.java:298)
  at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
  at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProviderBase.create(ConnectionProvider.scala:102)
  at org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1(JdbcDialects.scala:161)
  at org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1$adapted(JdbcDialects.scala:157)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.getQueryOutputSchema(JDBCRDD.scala:63)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:241)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:37)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:172)
  ... 49 elided
Caused by: java.net.ConnectException: Connection refused (Connection refused)
  at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
  at

Solution

  • tldr; change this line

    .option("url","jdbc:postgresql://host.docker.internal:5432/hive_metastore")
    

    spark is running under docker; so is Jupyter notebook. postgres is running locally under macOS.

    That's the problem. You can find the postgres db host spark try to connect is localhost:5432.

    You should note that the spark and postgres are not in same network. So the localhost in spark docker container is not the same localhost of host machine which running postgres.

    You could access host machine network inside container by using this host host.docker.internal.

    see: https://docs.docker.com/desktop/features/networking/#i-want-to-connect-from-a-container-to-a-service-on-the-host