sql-serverapache-sparkpysparkdatabricks

Spark sending LIMIT to SQL Server on display function


Update: Both my version and the version from Fred Alisson are working fine on a brand new cluster. It was a problem on my end with dialect configuration.

I was trying to query a SQL Server with spark but I cannot get it to display the resulting dataframe as spark is sending a literal "LIMIT" to the SQL Server resulting in a SQLServerException.

The following query runs absolutly fine

df = spark.read.format('sqlserver') \
        .option('host', sql_server_host) \
        .option('port', sql_server_port) \
        .option('database', sql_server_database) \
        .option('user', sql_server_user) \
        .option('password', sql_server_password) \
        .option("dbtable", 'dbo.TestTable')
        .load()

but when running

df.display() or display(df)

I get the following error:

SQLServerException: Incorrect syntax near '10001'.

Upon futher inspection I can see that spark is sending the following query to the sql server which obviouly fails because SQL Server does not support LIMIT.

SELECT TestColumn FROM dbo.TestTable LIMIT 10001 

Is there anything wrong with my configuration and is there any workaround?


Solution

  • By default Spark uses "LIMIT" (SQL Server is "TOP"). What you can try:

    1 - Read using JDBC connection

    2 - Specify the table using a custom query

    Code example for the first option:

    df = spark.read.format('jdbc') \
    .option('url', f'jdbc:sqlserver://{sql_server_host}:{sql_server_port};databaseName={sql_server_database}') \
    .option('dbtable', 'dbo.TestTable') \
    .option('user', sql_server_user) \
    .option('password', sql_server_password) \
    .option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver') \
    .load()
    

    Code example for the second option:

    query = "(SELECT * FROM dbo.TestTable) AS TestTable"
    
    df = spark.read.format('jdbc') \
    .option('url', f'jdbc:sqlserver://{sql_server_host}:{sql_server_port};databaseName={sql_server_database}') \
    .option('query', query) \
    .option('user', sql_server_user) \
    .option('password', sql_server_password) \
    .option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver') \
    .load()
    

    Try these ways and let me know if you need more help!