azureazure-blob-storagedatabricksdatabricks-unity-catalogdatabricks-workflows

Databricks Managed Storage Access In API Job Runs


I am using databricks for data processing and need to access external storage locations from the job run. I set up the external location access in databricks and can access the data using notebooks and a general purpose cluster, but my jobs cant access the storage accounts using the managed access.

I have tried setting the job clusters using:

spark.conf.set("fs.azure.account.auth.type.StorageAcct.blob.core.windows.net", "ManagedIdentity") spark.conf.set("fs.azure.account.auth.type.StorageAcct.blob.core.windows.net", "ManagedIdentity")

as well as the below in the spark configuration

f"fs.azure.account.auth.type.storage.blob.core.windows.net": "ManagedIdentity"


Solution

  • Use below code to configure managed identity.

    spark.conf.set("fs.azure.account.auth.type", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type","org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.msi.tenant", "YourTenantId")
    spark.conf.set("fs.azure.account.oauth2.client.id", "YourClientId")
    
    
    
    df = spark.read.format("csv").option("header", "true").load("abfss://data@vjgsblob.dfs.core.windows.net/csvs/sample_orders.csv")
    
    display(df)
    

    You databricks managed identity will be in managed resource group, from there you get client and tenant id.

    Go to managed resource group.

    enter image description here

    Next, click on managed identity

    enter image description here

    Then in properties you will find the details.

    enter image description here

    Also, make sure you given storage blob contributor role to the databricks managed identity.

    Output in workflow jobs using job compute.

    enter image description here