azuretimeoutaccess-tokenazure-synapseazure-synapse-pipeline

Azure Synapse time out - Token expire


Backstory: I have 12 zip files in Gen2 storage, each around 300 mb. I am running a notebook within a pipeline job. It goes smoothly until extracting the 6th zip file using pd.read_csv(compression ='zip'). around the 27min 57 secs mark, the token is expired.

So I open and played the synapse notebook.

ClientAuthenticationError: Server failed to authenticate the request. Please refer to the information in the www-authenticate header.

ErrorCode:InvalidAuthenticationInfo Authenticationerrordetail: Lifetime validation failed. The token is expired.

When I run a pipeline, it dies down at the similar position.

I am using the bare minimum as my company doesn't have enough funds for increasing nodes, etc.. Microsoft suggest to apply a retry-upon-failure. Also said synapse inability to handle token refreshes for non-users identities.

Is there a work around for this? Or is there screencaps that can assist me? Thanks!


Solution

  • In synapse you have below option to authenticate.

    Using linked service and Using storage options.

    1. Create a linked service to adls gen 2 account and use it while reading the files.

    enter image description here

    You can use system or user managed identity for authentication.

    code:

    import pandas as pd
       
    df = pd.read_csv('abfs://<container_name>@<storage_acc_name>.dfs.core.windows.net/<path>/parse1_data_preview.csv', storage_options = {'linked_service' : '<linked_service_name>'})
    df
    

    Output:

    enter image description here

    This doesn't expire, linked service handles the things.

    1. In storage option below, you can pass below credentials.

    code:

       import pandas
       
       #read data file
       df = pandas.read_csv('abfs://file_system_name@account_name.dfs.core.windows.net/   file_path', storage_options = {'account_key' : 'account_key_value'})
     
       ## or storage_options = {'sas_token' : 'sas_token_value'}
       ## or storage_options = {'connection_string' : 'connection_string_value'}
       ## or storage_options = {'tenant_id': 'tenant_id_value', 'client_id' : 'client_id_value',    'client_secret': 'client_secret_value'}
    

    Here, i recommend to use tenant_id, client_id, and client_secret.

    Refer below documentation for more information.

    Tutorial: Use Pandas to read/write ADLS data in serverless Apache Spark pool in Synapse Analytics - Azure Synapse Analytics | Microsoft Learn

    Tutorial: Use FSSPEC to read/write ADLS data in serverless Apache Spark pool in Synapse Analytics - Azure Synapse Analytics | Microsoft Learn

    If you still want to use the tokens you need to keep on checking the expiry.

    Below is the logic, you alter it according to your requirement.

    token_expiry_buffer = 300 
    token_expiry_time = 0
    
    def get_token():
        global token_expiry_time
        token = credential.get_token("https://storage.azure.com/")
        token_expiry_time = token.expires_on
        return token
    
    def is_token_expired():
        current_time = time.time()
        return current_time >= (token_expiry_time - token_expiry_buffer)
        
    paths = ["zip1path","zip2path","zip3path"...]
    
    for path in paths:
        if is_token_expired:
        #refresh the token
            get_token()
        df = pd.read_csv("path")