python azure azure-storage azure-databricks azure-files

Load file from Azure Files to Azure Databricks

Looking for a way using Azure files SDK to upload files to my azure databricks blob storage

I tried many things using function from this page

But nothing worked. I don't understand why

example:

file_service = FileService(account_name='MYSECRETNAME', account_key='mySECRETkey')

generator = file_service.list_directories_and_files('MYSECRETNAME/test') #listing file in folder /test, working well
for file_or_dir in generator:
    print(file_or_dir.name)

file_service.get_file_to_path('MYSECRETNAME','test/tables/input/referentials/','test.xlsx','/dbfs/FileStore/test6.xlsx')

with test.xlsx = name of file in my azure file

/dbfs/FileStore/test6.xlsx => path where to upload the file in my dbfs system

I have the error message:

Exception=The specified resource name contains invalid characters

Tried to change the name but doesn't seem to work

edit: I'm not even sure the function is doing what I want. What is the best way to load file from azure files?

Solution

Per my experience, I think the best way to load file from Azure Files is directly to read a file via its url with sas token.

For example, as the figures below, it's a file named test.xlsx in my test file share, that I viewed it using Azure Storage Explorer, then to generate its url with sas token.

Fig 1. Right click the file and then click the Get Shared Access Signature...

Fig 2. Must select the option Read permission for directly reading the file content.

Fig 3. Copy the url with sas token

Here is my sample code, you can run it with the sas token url of your file in your Azure Databricks.

import pandas as pd

url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.xlsx?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)

Alternatively, to use Azure File Storage SDK to generate the url with sas token for your file or to get the bytes of your file for reading, please refer to the offical document Develop for Azure Files with Python and my sample code below.

# Create a client of Azure File Service as same as yours
from azure.storage.file import FileService

account_name = '<your account name>'
account_key = '<your account key>'
share_name = 'test'
directory_name = None
file_name = 'test.xlsx'

file_service = FileService(account_name=account_name, account_key=account_key)

To generate the sas token url of a file

from azure.storage.file import FilePermissions
from datetime import datetime, timedelta
sas_token = file_service.generate_file_shared_access_signature(share_name, directory_name, file_name, permission=FilePermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))

url_sas_token = f"https://{account_name}.file.core.windows.net/{share_name}/{file_name}?{sas_token}"
import pandas as pd
pdf = pd.read_excel(url_sas_token)
df = spark.createDataFrame(pdf)

Or using get_file_to_stream function to read the file content

from io import BytesIO
import pandas as pd

stream = BytesIO()
file_service.get_file_to_stream(share_name, directory_name, file_name, stream)
pdf = pd.read_excel(stream)
df = spark.createDataFrame(pdf)