javaamazon-web-servicesamazon-sqs

Connection to SQS from EC2 is lost from time to time


Once a month on average, I lose the connection to AWS SQS from my java applications running on AWS EC2 machines:

Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.

With a root cause : java.net.UnknownHostException: sqs.eu-west-3.amazonaws.com

I've checked my java cache DNS configurations, that are the default ones of the docker image amazoncorretto:17-alpine:

Here is how I configure my SqsClient, using the AWS SDKv2:

SqsClient sqsClient = SqsClient.builder()
                    .region(Region.EU_WEST_3)
                    .credentialsProvider(InstanceProfileCredentialsProvider.create())
                    .build();

And how I consume message:

ReceiveMessageRequest receiveMessageRequest = ReceiveMessageRequest.builder()
                    .queueUrl(queueUrl)
                    .maxNumberOfMessages(1)
                    .visibilityTimeout(30)
                    .build();

sqsClient.receiveMessage(sqsRequest)
                .messages()
                .forEach(message -> /*some processing*/);

As I use the default configuration of the SqsClient, I am using software.amazon.awssdk.core.retry.RetryMode.STANDARD that retries twice, with an exponential back-off starting at 100ms, meaning that I will retry for less than a second, so below my networkaddress.cache.negative.ttl configuration.

  1. Should I only increase the number of retries ?
  2. I have multiple applications connecting to a eu-west-3 SQS queue, that are configured more or less in the same way. Shouldn't they all throw the UnknownHostException at the same moment ?

Solution

  • After this post last year, I implemented the solution we talked about in the comments and I did not encounter the problem anymore. Here is an analysis.

    How do we connect to SQS ?

    Typically, we poll an SQS queue through an URL such as https://sqs.eu-west-3.amazonaws.com/123456789/your-sqs-queue-name

    The thing to keep in mind is that the URL is behind a name, hence we must rely on DNS resolution to translate sqs.eu-west-3.amazonaws.com into something like 15.236.231.119

    How does java handles DNS resolution ?

    When a java application asks for a DNS resolution, it caches the result through the InetAddress class:

    The InetAddress class has a cache to store successful as well as unsuccessful host name resolutions.

    The TTL of these two caches are configured in the file $JAVA_HOME/jre/lib/security/java.security through these two properties:

    These are the default configuration of the amazoncorretto:17-alpine Docker image, and, I believe, other java version ones.

    What happens then ?

    For some reason, like some maintanance and whatnot, the IP address behind sqs.eu-west-1.amazonaws.com might change from time to time.

    Since InetAddress keeps in cache the former address, we end up with a UnknownHostException. This failure is kept for 10s, so when we try to access once again to sqs.eu-west-3.amazonaws.com in the next 10s, we will immediatly throw the exception without asking for DNS resolution.

    How does the AWS SDK handles this ?

    Poorly.

    When the SQS client fails to contact the queue, it will retry by default twice, so three attempts in total.

    A default value of 2 for maximum retry attempts, making a total of 3 call attempts. This value can be overwritten through the max_attempts configuration parameter.

    The default retry back-off strategy, FullJitterBackoffStrategy, works as follow:

    1. Base delay is 100ms
    2. Exponentially increase the delay
    3. Randomly select a value between 0 and the exponential delay calculated

    Meaning if we are on the second retry:

    1. Base delay is 100ms
    2. second retry : 100 * 22 = 400
    3. Back-off = rand(0, 400)

    So, best case scenario, we wait for 600ms. Since the unsuccessful DNS is cached for 10s, we end up throwing an error.

    What to do then ?

    Increase the number of retries, so that we retry at least for 10s

    Use the EqualJitterBackoffStrategy (we guarantee that each back-off is greater than the previous one)

    BackoffStrategy backoffStrategy = EqualJitterBackoffStrategy.builder()
            .baseDelay(Duration.ofMillis(100))
            .maxBackoffTime(Duration.ofSeconds(20))
            .build();
    
    SqsClientBuilder sqsClientBuilder = SqsClient.builder()
            .overrideConfiguration(configuration ->
                    configuration.retryPolicy(retryPolicy ->
                            retryPolicy.backoffStrategy(backoffStrategy)
                                    .numRetries(10)
                    )
            )
            .region(region);
    

    Bonus : what is EqualJitterBackoffStrategy ?

    1. We have a base delay
    2. We compute the exponential delay
    3. We divide this delay by 2
    4. We compute a random delay between 0 and the divided delay + 1.

    Back-off = divided delay + random delay

    Example:

    1. Base delay = 100ms
    2. Say we are at retry = 3 : exponential delay = 100 * 2^3 = 800
    3. Divided delay: 800 / 2 = 400
    4. Random delay: rand(0, 400 + 1) = 256

    Back-off = 400 + 256 = 656ms.