I am trying to write dataframe data into a table in Azure SQL from Databricks using pyspark. Table : dbo.test already exists in the database. I am able to read it before I execute below write operation.
testDf.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
.option("truncate", "false")\
.option("url", azure_sql_url).option("dbtable", 'dbo.test')\
.option("databaseName", database_name)\
.option("user", username) \
.option("password", password) \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
After executing this command the following error is returned:
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;
Surprisingly, the dbo.test table gets deleted.
Can someone help me understand why this is happening. Same code works fine in another environment.
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;
The cause of error is version mismatch between the Spark SQL and JDBC spark connector.Also, The mode("overwrite")
drops the table if already exists by default and re-creates a new one.
Here are some supported versions of JDBC spark connector ith respective spark version:
Resolution:
If you are using Databricks runtime version greater than 10 then you need to downgrade it to Databricks runtime version 9.1 LTS or down. and use respective connector.
I downgraded Databricks runtime version to 7.3 LTS:
Then installed appropriate library for spark 3.0.x which is com.microsoft.azure:spark-mssql-connector_2.12:1.1.0
And tried your code its working fine.
df_name.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
.option("truncate", "false")\
.option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
.option("databaseName", "databasename")\
.option("user", "username") \
.option("password", "password") \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
My execution:
OUTPUT:
Or you can directly use dataframename.format("jdbc")
Sample CODE:
df_name.write.format("jdbc").mode("overwrite")\
.option("truncate", "false")\
.option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
.option("databaseName", "databasename")\
.option("user", "username") \
.option("password", "password") \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
You can also refer similar issue here on Github or this similar SO thread
Reference: https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16