I'm trying to connect to Azure SQL Database from Azure Synapse workspace Notebook using PySpark. Also I would like to use Active Directory integrated authentication. So what I've tried:
jdbc_df = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", "jdbc:sqlserver://my_server_name.database.windows.net:1433") \
.option("database","my_db_name") \
.option("dbtable", "my_table_or_query") \
.option("authentication", "ActiveDirectoryIntegrated") \
.option("encrypt", "true") \
.option("hostNameInCertificate", "*.database.windows.net") \
.load()
Also I've tried the same way but in different syntax
jdbcUrl = "jdbc:sqlserver://my_server_name.database.windows.net:1433;database=my_db_name;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;Authentication=ActiveDirectoryIntegrated"
pushdown_query = "SELECT col1 FROM my_table_name"
connectionProperties = {
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query)
display(df)
And in both cases I get error
IllegalArgumentException: KrbException: Cannot locate default realm
What I'm doing wrong?
Finally I have found the solution! First of all there should be created working Linked service to Azure SQL database in your Synapse Analytics that uses Authentication type "System Assigned Managed Identity". Than you can reference it in your PySpark Notebook. And don't be confused that method getConnectionString is used to get access token - it really returns not connection string but token.
jdbcUrl = "jdbc:sqlserver://my_server_name.database.windows.net:1433;database=my_db_name;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"
token=TokenLibrary.getConnectionString("AzureSQLLinkedServiceName")
pushdown_query = "(SELECT col1 FROM my_table_name) as tbl"
connectionProperties = {
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"accessToken" : token
}
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)