currently using GCP and Dataproc, i´m new to apache spark, pyspark and debian vm. So, i´m trying to replicated inside dataproc cluster (Debian VM) a spark job that i run perfectly in my local machine (W10, VS Code, Spark 3.3.1). Ingestion from SQL Server to Spark dataframe, via JDBC driver.
When i tried inside this Debian VM, SparkSession.read() works correctly but dataframe.show() not.
Debian VM Configuration: Debian 10 with Hadoop 3.2 and Spark 3.1.3.
JDBC driver: mssql-jdbc-11.2.1.jre8.jar from here
Java version: openjdk version "1.8.0_352" to run correctly SparkSession.read() i have to delete java.security inside this path in Debian VM: $JAVA_HOME/jre/lib/security/java.security
Pyspark launch
pyspark --jars gs://bucket/mssql-jdbc-11.2.1.jre8.jar
Parameters
server_name = "jdbc:sqlserver://sqlhost:1433;"
database_name = "dbname"
url = server_name + ";" + "databaseName=" + database_name + ";encrypt=false;"
query = "SELECT * FROM dbo.tablename"
username = "user"
password = "pass"
sparkSession.read()
dataFrame = SparkSession.read \
.format("jdbc") \
.option("url",url) \
.option("user", username) \
.option("password", password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("query", table) \
.load()
Results in debian VM:
dataFrame.show(5)
22/11/17 13:08:16 WARN org.apache.spark.sql.catalyst.util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
22/11/17 13:09:57 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (cluster-name.internal executor 2): com.microsoft.sqlserver.jdbc.SQLServerException: The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption. Error: "Connection reset ClientConnectionId:".
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:3806)
at com.microsoft.sqlserver.jdbc.TDSChannel.enableSSL(IOBuffer.java:1906)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:3329)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:2950)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:2790)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1663)
at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:1064)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:77)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Connection reset ClientConnectionId:700149ae-3483-4315-8c2e-de1bc11ce6b3
at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.readInternal(IOBuffer.java:974)
at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.read(IOBuffer.java:961)
at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.readInternal(IOBuffer.java:1207)
at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.read(IOBuffer.java:1194)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
at org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
at com.microsoft.sqlserver.jdbc.TDSChannel.enableSSL(IOBuffer.java:1795)
... 25 more
But, like i said, is only with .show() or also .write().
.printSchema() works fine.
For sparkSession.read(), works fine after java.security delete, before i have the same SSL exception.
why is showing again the SSL exception? Any clues?
It seems that this is caused by the fact that MS SQL JDBC connector jar is not compatible with Conscrypt library that Dataproc uses to improve performance. I would advise to file a bug for MS SQL JDBC connector owners to fix this compatibility issue.
As a workaround you can disable Conscrypt when creating a Dataproc clsuster using dataproc:dataproc.conscrypt.provider.enable=false
property, but it may have negative impact on performance:
gcloud dataproc clusters create ${CLUSTER_NAME} \
. . . \
--properties=dataproc:dataproc.conscrypt.provider.enable=false \
. . .