azureapache-sparkdatabricksazure-data-lakeazure-data-lake-gen2

Access ADLS Gen2 using pem/certificate from Apache Spark


I have an Azure SPN which allows me to read data from ADLS Gen2 using certificates (.pem) file. When I use Azure SDK, I can easily create the following object

from azure.identity import CertificateCredential

azure_credential = CertificateCredential(
                tenant_id=tenant_id,
                client_id=client_id,
                certificate_data=client_certificate_bytes
            )

and then use it to access ADLS Gen2

DataLakeServiceClient(
        account_url=ACCOUNT_URL,
        credential=azure_credential,
    ).get_file_system_client(blob_container_name)

I need to use the same SPN from Apache Spark but I cannot find a way to achieve this. As per the authentication methods written in ABFS documentation, it doesn't seem to be possible (https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html#Authentication). My goal would be to set the certificate like I used to set the client secrets in Spark conf like below.

spark_session.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark_session.conf.set("fs.adl.oauth2.client.id", client_id)
spark_session.conf.set("fs.adl.oauth2.credential", client_secret)
spark_session.conf.set("fs.adl.oauth2.refresh.url", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")

Solution

  • Connecting to ADLS Gen2 in Databricks: Configuration Options

    According to the official Databricks documentation, there are multiple ways to connect to Azure Data Lake Storage (ADLS) Gen2. The available methods include:

    1. OAuth 2.0 with an Azure Service Principal
    2. Shared Access Signatures (SAS)
    3. Account Keys

    It's important to note that configuring a certificate (.pem) directly in Spark configuration, akin to how secrets are managed, is currently not supported.