apache-sparkpysparkgoogle-cloud-dataprocmssql-jdbcgoogle-conscrypt

dataframe.show() not work in Pyspark inside a Debian VM (Dataproc)


currently using GCP and Dataproc, i´m new to apache spark, pyspark and debian vm. So, i´m trying to replicated inside dataproc cluster (Debian VM) a spark job that i run perfectly in my local machine (W10, VS Code, Spark 3.3.1). Ingestion from SQL Server to Spark dataframe, via JDBC driver.

When i tried inside this Debian VM, SparkSession.read() works correctly but dataframe.show() not.

Debian VM Configuration: Debian 10 with Hadoop 3.2 and Spark 3.1.3.

JDBC driver: mssql-jdbc-11.2.1.jre8.jar from here

Java version: openjdk version "1.8.0_352" to run correctly SparkSession.read() i have to delete java.security inside this path in Debian VM: $JAVA_HOME/jre/lib/security/java.security

Pyspark launch

pyspark --jars gs://bucket/mssql-jdbc-11.2.1.jre8.jar

Parameters

server_name = "jdbc:sqlserver://sqlhost:1433;"
database_name = "dbname"
url = server_name + ";" + "databaseName=" + database_name + ";encrypt=false;"
query = "SELECT * FROM dbo.tablename"
username = "user"
password = "pass"

sparkSession.read()

dataFrame = SparkSession.read \
.format("jdbc") \
.option("url",url) \
.option("user", username) \
.option("password", password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("query", table) \
.load()

Results in debian VM:

dataFrame.show(5)
22/11/17 13:08:16 WARN org.apache.spark.sql.catalyst.util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
22/11/17 13:09:57 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (cluster-name.internal executor 2): com.microsoft.sqlserver.jdbc.SQLServerException: The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption. Error: "Connection reset ClientConnectionId:".
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:3806)
        at com.microsoft.sqlserver.jdbc.TDSChannel.enableSSL(IOBuffer.java:1906)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:3329)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:2950)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:2790)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1663)
        at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:1064)
        at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
        at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:77)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Connection reset ClientConnectionId:700149ae-3483-4315-8c2e-de1bc11ce6b3
        at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.readInternal(IOBuffer.java:974)
        at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.read(IOBuffer.java:961)
        at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.readInternal(IOBuffer.java:1207)
        at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.read(IOBuffer.java:1194)
        at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
        at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
        at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
        at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
        at org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
        at com.microsoft.sqlserver.jdbc.TDSChannel.enableSSL(IOBuffer.java:1795)
        ... 25 more

But, like i said, is only with .show() or also .write().

.printSchema() works fine.

For sparkSession.read(), works fine after java.security delete, before i have the same SSL exception.

why is showing again the SSL exception? Any clues?


Solution

  • It seems that this is caused by the fact that MS SQL JDBC connector jar is not compatible with Conscrypt library that Dataproc uses to improve performance. I would advise to file a bug for MS SQL JDBC connector owners to fix this compatibility issue.

    As a workaround you can disable Conscrypt when creating a Dataproc clsuster using dataproc:dataproc.conscrypt.provider.enable=false property, but it may have negative impact on performance:

    gcloud dataproc clusters create ${CLUSTER_NAME} \
        . . . \
        --properties=dataproc:dataproc.conscrypt.provider.enable=false \
        . . .