Once a month on average, I lose the connection to AWS SQS from my java applications running on AWS EC2 machines:
Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.
With a root cause :
java.net.UnknownHostException: sqs.eu-west-3.amazonaws.com
I've checked my java cache DNS configurations, that are the default ones of the docker image amazoncorretto:17-alpine
:
networkaddress.cache.ttl=30
networkaddress.cache.negative.ttl=10
Here is how I configure my SqsClient, using the AWS SDKv2:
SqsClient sqsClient = SqsClient.builder()
.region(Region.EU_WEST_3)
.credentialsProvider(InstanceProfileCredentialsProvider.create())
.build();
And how I consume message:
ReceiveMessageRequest receiveMessageRequest = ReceiveMessageRequest.builder()
.queueUrl(queueUrl)
.maxNumberOfMessages(1)
.visibilityTimeout(30)
.build();
sqsClient.receiveMessage(sqsRequest)
.messages()
.forEach(message -> /*some processing*/);
As I use the default configuration of the SqsClient
, I am using software.amazon.awssdk.core.retry.RetryMode.STANDARD
that retries twice, with an exponential back-off starting at 100ms, meaning that I will retry for less than a second, so below my networkaddress.cache.negative.ttl
configuration.
eu-west-3
SQS queue, that are configured more or less in the same way. Shouldn't they all throw the UnknownHostException
at the same moment ?After this post last year, I implemented the solution we talked about in the comments and I did not encounter the problem anymore. Here is an analysis.
Typically, we poll an SQS queue through an URL such as https://sqs.eu-west-3.amazonaws.com/123456789/your-sqs-queue-name
The thing to keep in mind is that the URL is behind a name, hence we must rely on DNS resolution to translate sqs.eu-west-3.amazonaws.com into something like 15.236.231.119
When a java application asks for a DNS resolution, it caches the result through the InetAddress class:
The InetAddress class has a cache to store successful as well as unsuccessful host name resolutions.
The TTL of these two caches are configured in the file $JAVA_HOME/jre/lib/security/java.security
through these two properties:
networkaddress.cache.ttl
= 30s by default, as no security manager is set.networkaddress.cache.negative.ttl
= 10s by default, keeping the “unsuccessful host name resolutions”These are the default configuration of the amazoncorretto:17-alpine
Docker image, and, I believe, other java version ones.
For some reason, like some maintanance and whatnot, the IP address behind sqs.eu-west-1.amazonaws.com
might change from time to time.
Since InetAddress keeps in cache the former address, we end up with a UnknownHostException
. This failure is kept for 10s, so when we try to access once again to sqs.eu-west-3.amazonaws.com
in the next 10s, we will immediatly throw the exception without asking for DNS resolution.
Poorly.
When the SQS client fails to contact the queue, it will retry by default twice, so three attempts in total.
A default value of 2 for maximum retry attempts, making a total of 3 call attempts. This value can be overwritten through the max_attempts configuration parameter.
The default retry back-off strategy, FullJitterBackoffStrategy, works as follow:
Meaning if we are on the second retry:
So, best case scenario, we wait for 600ms. Since the unsuccessful DNS is cached for 10s, we end up throwing an error.
Increase the number of retries, so that we retry at least for 10s
Use the EqualJitterBackoffStrategy (we guarantee that each back-off is greater than the previous one)
BackoffStrategy backoffStrategy = EqualJitterBackoffStrategy.builder()
.baseDelay(Duration.ofMillis(100))
.maxBackoffTime(Duration.ofSeconds(20))
.build();
SqsClientBuilder sqsClientBuilder = SqsClient.builder()
.overrideConfiguration(configuration ->
configuration.retryPolicy(retryPolicy ->
retryPolicy.backoffStrategy(backoffStrategy)
.numRetries(10)
)
)
.region(region);
Back-off = divided delay + random delay
Example:
Back-off = 400 + 256 = 656ms.