sql-serverpysparkapache-spark-sqlazure-sql-databasespark-jdbc

Table gets deleted when trying to overwrite the data in it from databricks spark


I am trying to write dataframe data into a table in Azure SQL from Databricks using pyspark. Table : dbo.test already exists in the database. I am able to read it before I execute below write operation.

testDf.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
        .option("truncate", "false")\
        .option("url", azure_sql_url).option("dbtable", 'dbo.test')\
        .option("databaseName", database_name)\
        .option("user", username) \
        .option("password", password) \
        .option("encrypt", "true")\
        .option("hostNameInCertificate", "*.database.windows.net")\
        .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
        .option("bulkCopyTimeout", "6000000")\
        .save()

After executing this command the following error is returned:

java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;

Surprisingly, the dbo.test table gets deleted.

Can someone help me understand why this is happening. Same code works fine in another environment.


Solution

  • java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;

    The cause of error is version mismatch between the Spark SQL and JDBC spark connector.Also, The mode("overwrite") drops the table if already exists by default and re-creates a new one.

    Here are some supported versions of JDBC spark connector ith respective spark version:

    Resolution:

    If you are using Databricks runtime version greater than 10 then you need to downgrade it to Databricks runtime version 9.1 LTS or down. and use respective connector.

    df_name.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
            .option("truncate", "false")\
            .option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
            .option("databaseName", "databasename")\
            .option("user", "username") \
            .option("password", "password") \
            .option("encrypt", "true")\
            .option("hostNameInCertificate", "*.database.windows.net")\
            .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
            .option("bulkCopyTimeout", "6000000")\
            .save()
    

    My execution:

    enter image description here

    OUTPUT:

    enter image description here

    Or you can directly use dataframename.format("jdbc")

    Sample CODE:

    df_name.write.format("jdbc").mode("overwrite")\
            .option("truncate", "false")\
            .option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
            .option("databaseName", "databasename")\
            .option("user", "username") \
            .option("password", "password") \
            .option("encrypt", "true")\
            .option("hostNameInCertificate", "*.database.windows.net")\
            .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
            .option("bulkCopyTimeout", "6000000")\
            .save()
    

    You can also refer similar issue here on Github or this similar SO thread

    Reference: https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16