I have an Azure SPN which allows me to read data from ADLS Gen2 using certificates (.pem) file. When I use Azure SDK, I can easily create the following object
from azure.identity import CertificateCredential
azure_credential = CertificateCredential(
tenant_id=tenant_id,
client_id=client_id,
certificate_data=client_certificate_bytes
)
and then use it to access ADLS Gen2
DataLakeServiceClient(
account_url=ACCOUNT_URL,
credential=azure_credential,
).get_file_system_client(blob_container_name)
I need to use the same SPN from Apache Spark but I cannot find a way to achieve this. As per the authentication methods written in ABFS documentation, it doesn't seem to be possible (https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html#Authentication). My goal would be to set the certificate like I used to set the client secrets in Spark conf like below.
spark_session.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark_session.conf.set("fs.adl.oauth2.client.id", client_id)
spark_session.conf.set("fs.adl.oauth2.credential", client_secret)
spark_session.conf.set("fs.adl.oauth2.refresh.url", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
According to the official Databricks documentation, there are multiple ways to connect to Azure Data Lake Storage (ADLS) Gen2. The available methods include:
It's important to note that configuring a certificate (.pem) directly in Spark configuration, akin to how secrets are managed, is currently not supported.