azuredatabricksazure-databricksdelta-lakesensitive-data

Acces file from Azure Data Lake sensitive storage by databricks


I m accesing to files in the normal storage by the following method:

input_path = "my_path"
file= "file.mp3"
path = os.path.join(path_data, file)
full_path = '/dbfs/' + path

with open(full_path, mode='rb') as file: # b is important -> binary
    fileContent = file.read()

I am not able to use the same method in sensitive storage

I am aware that sensitive storage have another way to acces data

path_sensitive_storage = 'mypath_sensitive'

If I use spark it works perfectly, but i am interested in not using spark read but open file

input_df = (spark.read
            .format("binaryFile")
            .option("header", "true")
            .option("encoding", "UTF-8")
            .csv(full_path)
            )

There is a way to do that ?


Solution

  • Since you are using Azure Data Lake as a source, you need to mount the container in Databricks DBFS by using OAuth method. Once the container is mounted, you can use it.

    Use the code below to mount the container.

    configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "ba219eb4-0250-4780-8bd3-d7f3420dab6d",
           "fs.azure.account.oauth2.client.secret": "0wP8Q~qWUwGSFrjyByvwK-.HjrHx2EEvG06X9cmy",
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token",
           "fs.azure.createRemoteFileSystemDuringInitialization": "true"}
    
    dbutils.fs.mount(
    source = "abfss://sample11@utrolicstorage11.dfs.core.windows.net/",
    mount_point = "/mnt/sampledata11",
    extra_configs = configs)
    

    Once mounted, you can use below code to list the files in mounted location.

    dbutils.fs.ls("/mnt/sampledata11/")
    

    And finally use with open statement to read the file

    with open("/dbfs/mnt/sampledata11/movies.csv", mode='rb') as file: # b is important -> binary
        fileContent = file.read()
        print(fileContent)
    

    Check the image below for the complete implementation and the outputs below.

    enter image description here