amazon-web-servicesamazon-athenaamazon-sagemakeraws-data-wrangler

how to use athena VPC endpoint to query data from isolated network mode in sagemaker preprocesing job


I wrote a sagemaker processing job in isolated network . it has an Athena sql that reads from athena to a dataframe .

But it throws error as "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

can anyone help with a lead on how to utilize the vpc ednpoint to athena query in python. I tried to see below document but no place i can see the url information https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.athena.read_sql_query.html

recent_data_sql="select * from db.dummy_table"
recent_data_df = wr.athena.read_sql_query(
        sql=recent_data_sql, database="default", ctas_approach=False, workgroup="wg",
        boto3_session=session
        )

Solution

  • If you set EnableNetworkIsolation to true when you created the Processing Job you will not be able to make any network calls from your script/container. Hence, you will not be able to have the above mentioned code run in your script as it will try make a network call to Athena endpoint. Albeit, you should have received a timeout error. Make sure you are not bundling any credentials into your container. The Processing Job has an IAM execution role attached and that role should have access relevant Athena access for your code.

    If you want to hit a VPC endpoint (PrivateLink) for Athena, you can launch the Processing Job with a VPC config by specifying subnets that have a route to your Athena VPC endpoint. The processing Job will need to have EnableNetworkIsolation set to false.

    Note: After you create an interface VPC endpoint, if you enable private DNS hostnames for the endpoint, the default Athena endpoint (https://athena.Region.amazonaws.com) resolves to your VPC endpoint. https://docs.aws.amazon.com/athena/latest/ug/interface-vpc-endpoint.html