amazon-web-servicesaws-lambdaamazon-vpcsts-securitytokenservicevpc-endpoint

Lambda function failing intermittently due to connect to sts.amazonaws.com timed out


I have a lambda running in VPC. Using which I query ElasticSearch and update data there and delete obsolete data. To facilitate this call, lambda has to assume a role and it calls STS Assume role API for that. but recently, I am seeing intermittent time-outs whenever I try to fetch credentials. The code is :

final AWSSecurityTokenService stsClient = AWSSecurityTokenServiceClientBuilder.standard()
            .withCredentials(new EnvironmentVariableCredentialsProvider())
            .build();

        final STSAssumeRoleSessionCredentialsProvider credentials = new STSAssumeRoleSessionCredentialsProvider.Builder(
            System.getenv(SIM_ROLE_KEY), SIM_SESSION_NAME
        ).withStsClient(stsClient)
            .build();

        final String sessionToken = credentials.getCredentials().getSessionToken();

Exact error :

Unable to execute HTTP request: Connect to sts.amazonaws.com:443 [sts.amazonaws.com/209.54.180.124] failed: connect timed out: com.amazonaws.SdkClientException
com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sts.amazonaws.com:443 [sts.amazonaws.com/209.54.180.124] failed: connect timed out

I want to know, what could be the reason behind this intermittent failures and how do we fix it. I also want to know whether intermittent time-outs is a common issue for STS calls?

Things I tried :

1). instead of global end-point sts.amazonaws.com , I configured end-point to be sts.us-east-1.amazonaws.com because I am running the lambda in us-east-1 region. We were still able to see the same error.

2). It did not have the VPC end-point, so i created the VPC end-point. Now it doesn't throw the time-out error. But i am not sure if it is the intended fix. If it was the fix then STS calls would have timed-out all the time. if there is no VPC end-point, then how it is able to connect with sts.amazonaws.com most of the time?.

I can provide more information if needed.

More info : Lambda function has 3 subnets attached. 2 private 1 public. Route Tables for all the subnets.

VPCStack Private Route Table 1 :
Destination       Target
10.0.0.0/16       local
0.0.0.0/0         nat-####1
pl-63a5400a       vpce-####3
VPCStack Private Route Table 2 : 
Destination     Target
10.0.0.0/16     local
0.0.0.0/0       nat-####2
pl-63a5400a     vpce-####4
VPCStack Public Route Table :
Destination    Target
10.0.0.0/16    local
0.0.0.0/0      igw-####5
pl-63a5400a    vpce-####

Thanks.


Solution

  • When you configure a Lambda function for VPC access, configure it to connect to private subnets only.

    Your original problem causing intermittent connectivity issues to STS is that you configured the Lambda function to connect to both private and public subnets:

    1. Lambda functions cannot reach the internet if they are connected to a public subnet.
    2. Lambda functions cannot reach AWS services if they are connected to a public subnet, unless you have configured a VPC Endpoint for that AWS service.

    When you introduced the VPC Endpoint, it worked correctly because all traffic destined for STS routed via the VPC Endpoint and no longer had to rely on a route via your NAT. Routing via your NAT worked for the Lambda functions connected to one of your private subnets, but not for the Lambda functions connected to the public subnet.