javaamazon-web-servicessslamazon-kmsssl-handshake

ssl handshake with KMS server taking time(50sec) for a few requests event though socket connection timeout is 2000 as per logs


We are using AwsCrypto aws java sdk for encryption and decryption. We follow the pattern mentioned in this aws doc for using the same with data-key cache enabled.

For a few requests I am seeing intermittently TLS handshake with kms server is taking time(50 sec max before retrying to establish connection as per logs) but then there are other similar requests where TLS handshake is happening withing ms.

As per logs the socket connection timeout is set to 2000 ms but for some reason the connection timeout is not occuring and thread is stuck on waiting for handshake response for more than 30 sec and ranging upto 50 seconds.

This is more problem as thread is blocked for no-reason and as our service scale it can be a bottleneck and we want to fix these latency spikes due to kms.

Related logs

*.*.awssdk.http.apache.internal.conn.SdkTlsSocketFactory: Connecting socket to kms.us-east-1.amazonaws.com/52.119.199.83:443 with timeout 2000

2023-12-09T14:26:04.993Z *.*.awssdk.http.apache.*.conn.SdkTlsSocketFactory: Starting handshake

2023-12-09T14:26:35.029Z *.*.awssdk.request: Retryable error detected. Will retry in 43ms. Request attempt number 2

As can be seen the after handshake was initiated connection didn't break for 30 sec before retrying. But the timeout for connecting to socket was 2sec as can be seen from 1st log

Is there some mis-configuration that's causing this or some other issue?

Our service is a ECS based service usign aws sdk 1.x

PS: For those voting to close kindly put a comment as to why this question should be closed. I would be happy to do that myself given that there is acceptable reason.


Solution

  • Based on a POC, where I tried diff. types of timeouts associated to KMS client, learned the following:

    In our case we were not setting any value for the same and it was using default value of 50 sec. Once we set a custom value of socket timeout we were able to get rid of high handshake time.

    Though it's not known why handshake was taking that long but as per my research there can be transient network issues which can cause it and it's better to have custom socket timeout so that client can retry and fail-fast.