Backstory: I have 12 zip files in Gen2 storage, each around 300 mb. I am running a notebook within a pipeline job. It goes smoothly until extracting the 6th zip file using pd.read_csv(compression ='zip'). around the 27min 57 secs mark, the token is expired.
So I open and played the synapse notebook.
ClientAuthenticationError: Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
ErrorCode:InvalidAuthenticationInfo Authenticationerrordetail: Lifetime validation failed. The token is expired.
When I run a pipeline, it dies down at the similar position.
I am using the bare minimum as my company doesn't have enough funds for increasing nodes, etc.. Microsoft suggest to apply a retry-upon-failure. Also said synapse inability to handle token refreshes for non-users identities.
Is there a work around for this? Or is there screencaps that can assist me? Thanks!
In synapse you have below option to authenticate.
Using linked service and Using storage options.
You can use system or user managed identity for authentication.
code:
import pandas as pd
df = pd.read_csv('abfs://<container_name>@<storage_acc_name>.dfs.core.windows.net/<path>/parse1_data_preview.csv', storage_options = {'linked_service' : '<linked_service_name>'})
df
Output:
This doesn't expire, linked service handles the things.
code:
import pandas
#read data file
df = pandas.read_csv('abfs://file_system_name@account_name.dfs.core.windows.net/ file_path', storage_options = {'account_key' : 'account_key_value'})
## or storage_options = {'sas_token' : 'sas_token_value'}
## or storage_options = {'connection_string' : 'connection_string_value'}
## or storage_options = {'tenant_id': 'tenant_id_value', 'client_id' : 'client_id_value', 'client_secret': 'client_secret_value'}
Here, i recommend to use tenant_id
, client_id
, and client_secret
.
Refer below documentation for more information.
If you still want to use the tokens you need to keep on checking the expiry.
Below is the logic, you alter it according to your requirement.
token_expiry_buffer = 300
token_expiry_time = 0
def get_token():
global token_expiry_time
token = credential.get_token("https://storage.azure.com/")
token_expiry_time = token.expires_on
return token
def is_token_expired():
current_time = time.time()
return current_time >= (token_expiry_time - token_expiry_buffer)
paths = ["zip1path","zip2path","zip3path"...]
for path in paths:
if is_token_expired:
#refresh the token
get_token()
df = pd.read_csv("path")