I am trying to get Scala Spark to connect to a Postgres database without success; without using docker (that is with everything installed locally) it works as expected. I have PostgreSQL-42.7.5.jar in the Sparks jar folder. I am able to connect to the database via pgAdmin4 and browse the tables therefore it is not a connection issue with the database itself. Yet, once everything is put into a docker image it does not work.
spark is running under docker; so is Jupyter notebook.
postgres is running locally under macOS.
To reporuduce the issue start the docker image here: https://github.com/JosephMRally/spark-dp-101/tree/loadRealEstateData
org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:352)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:273)
at org.postgresql.Driver.makeConnection(Driver.java:446)
at org.postgresql.Driver.connect(Driver.java:298)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProviderBase.create(ConnectionProvider.scala:102)
at org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1(JdbcDialects.scala:161)
at org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1$adapted(JdbcDialects.scala:157)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.getQueryOutputSchema(JDBCRDD.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:241)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:37)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:172)
... 49 elided
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at
tldr; change this line
.option("url","jdbc:postgresql://host.docker.internal:5432/hive_metastore")
spark is running under docker; so is Jupyter notebook. postgres is running locally under macOS.
That's the problem. You can find the postgres db host spark try to connect is localhost:5432
.
You should note that the spark and postgres are not in same network. So the localhost
in spark docker container is not the same localhost
of host machine which running postgres.
You could access host machine network inside container by using this host host.docker.internal
.