pythonazureazure-storageazure-databricksazure-files

Load file from Azure Files to Azure Databricks


Looking for a way using Azure files SDK to upload files to my azure databricks blob storage

I tried many things using function from this page

But nothing worked. I don't understand why

example:

file_service = FileService(account_name='MYSECRETNAME', account_key='mySECRETkey')

generator = file_service.list_directories_and_files('MYSECRETNAME/test') #listing file in folder /test, working well
for file_or_dir in generator:
    print(file_or_dir.name)

file_service.get_file_to_path('MYSECRETNAME','test/tables/input/referentials/','test.xlsx','/dbfs/FileStore/test6.xlsx')

with test.xlsx = name of file in my azure file

/dbfs/FileStore/test6.xlsx => path where to upload the file in my dbfs system

I have the error message:

Exception=The specified resource name contains invalid characters

Tried to change the name but doesn't seem to work

edit: I'm not even sure the function is doing what I want. What is the best way to load file from azure files?


Solution

  • Per my experience, I think the best way to load file from Azure Files is directly to read a file via its url with sas token.

    For example, as the figures below, it's a file named test.xlsx in my test file share, that I viewed it using Azure Storage Explorer, then to generate its url with sas token.

    Fig 1. Right click the file and then click the Get Shared Access Signature...

    enter image description here

    Fig 2. Must select the option Read permission for directly reading the file content.

    enter image description here

    Fig 3. Copy the url with sas token

    enter image description here

    Here is my sample code, you can run it with the sas token url of your file in your Azure Databricks.

    import pandas as pd
    
    url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.xlsx?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
    # Directly read the file content from its url with sas token to get a pandas dataframe
    pdf = pd.read_excel(url_sas_token )
    # Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
    df = spark.createDataFrame(pdf)
    

    Alternatively, to use Azure File Storage SDK to generate the url with sas token for your file or to get the bytes of your file for reading, please refer to the offical document Develop for Azure Files with Python and my sample code below.

    # Create a client of Azure File Service as same as yours
    from azure.storage.file import FileService
    
    account_name = '<your account name>'
    account_key = '<your account key>'
    share_name = 'test'
    directory_name = None
    file_name = 'test.xlsx'
    
    file_service = FileService(account_name=account_name, account_key=account_key)
    

    To generate the sas token url of a file

    from azure.storage.file import FilePermissions
    from datetime import datetime, timedelta
    sas_token = file_service.generate_file_shared_access_signature(share_name, directory_name, file_name, permission=FilePermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
    
    url_sas_token = f"https://{account_name}.file.core.windows.net/{share_name}/{file_name}?{sas_token}"
    import pandas as pd
    pdf = pd.read_excel(url_sas_token)
    df = spark.createDataFrame(pdf)
    

    Or using get_file_to_stream function to read the file content

    from io import BytesIO
    import pandas as pd
    
    stream = BytesIO()
    file_service.get_file_to_stream(share_name, directory_name, file_name, stream)
    pdf = pd.read_excel(stream)
    df = spark.createDataFrame(pdf)