azure timeout access-token azure-synapse azure-synapse-pipeline

Azure Synapse time out - Token expire

Backstory: I have 12 zip files in Gen2 storage, each around 300 mb. I am running a notebook within a pipeline job. It goes smoothly until extracting the 6th zip file using pd.read_csv(compression ='zip'). around the 27min 57 secs mark, the token is expired.

So I open and played the synapse notebook.

ClientAuthenticationError: Server failed to authenticate the request. Please refer to the information in the www-authenticate header.

ErrorCode:InvalidAuthenticationInfo Authenticationerrordetail: Lifetime validation failed. The token is expired.

When I run a pipeline, it dies down at the similar position.

I am using the bare minimum as my company doesn't have enough funds for increasing nodes, etc.. Microsoft suggest to apply a retry-upon-failure. Also said synapse inability to handle token refreshes for non-users identities.

Is there a work around for this? Or is there screencaps that can assist me? Thanks!

Solution

In synapse you have below option to authenticate.

Using linked service and Using storage options.

Create a linked service to adls gen 2 account and use it while reading the files.

enter image description here

You can use system or user managed identity for authentication.

code:

import pandas as pd
   
df = pd.read_csv('abfs://<container_name>@<storage_acc_name>.dfs.core.windows.net/<path>/parse1_data_preview.csv', storage_options = {'linked_service' : '<linked_service_name>'})
df

Output:

enter image description here

This doesn't expire, linked service handles the things.

In storage option below, you can pass below credentials.

code:

   import pandas
   
   #read data file
   df = pandas.read_csv('abfs://file_system_name@account_name.dfs.core.windows.net/   file_path', storage_options = {'account_key' : 'account_key_value'})
 
   ## or storage_options = {'sas_token' : 'sas_token_value'}
   ## or storage_options = {'connection_string' : 'connection_string_value'}
   ## or storage_options = {'tenant_id': 'tenant_id_value', 'client_id' : 'client_id_value',    'client_secret': 'client_secret_value'}

Here, i recommend to use tenant_id, client_id, and client_secret.

Refer below documentation for more information.

Tutorial: Use Pandas to read/write ADLS data in serverless Apache Spark pool in Synapse Analytics - Azure Synapse Analytics | Microsoft Learn

Tutorial: Use FSSPEC to read/write ADLS data in serverless Apache Spark pool in Synapse Analytics - Azure Synapse Analytics | Microsoft Learn

If you still want to use the tokens you need to keep on checking the expiry.

Below is the logic, you alter it according to your requirement.

token_expiry_buffer = 300 
token_expiry_time = 0

def get_token():
    global token_expiry_time
    token = credential.get_token("https://storage.azure.com/")
    token_expiry_time = token.expires_on
    return token

def is_token_expired():
    current_time = time.time()
    return current_time >= (token_expiry_time - token_expiry_buffer)
    
paths = ["zip1path","zip2path","zip3path"...]

for path in paths:
    if is_token_expired:
    #refresh the token
        get_token()
    df = pd.read_csv("path")