encryptionsecret-keyaws-secrets-managershared-secret

How to implement zero-downtime key rotation


I have several micro-services running in AWS, some of which communicate with each other, some of them having external clients or being clients to external services.

To implement my services I need a number of secrets (RSA key pairs to sign/verify tokens, symmetric keys, API keys etc). I am using AWS SecretsManager for this, and it works fine, but I'm now in process of implementing proper support for key rotation and I have a few thoughts.

Let's say service A needs a key K for service B:

Is this the best approach or are there others to consider?

Then, in some situations I have a symmetric key J that is used within the same service, for example a key to encrypt some session with. So in one request to service C, a session is encrypted with key J1, then needs to be decrypted with J1 at a later stage. I have multiple instances of the C service.

The problem here is that if the same secret is used for both encryption and decryption, rotating it becomes more messy - if the key is rotated to have the value J2 and one instance has refreshed so that it will encrypt with J2, while another instance still doesn't see J2, the decryption will fail.

I can see a few approaches here:

  1. Split into two secrets with separate rotation schemes and rotate one at a time, similar to the above. This adds overhead in terms of extra secrets to handle, with identical values (apart from them being rotated with some time in between)

  2. Let the decryption force a refresh of the secret upon failure:

    • Encryption always uses AWSCURRENT (J1 or J2 depending on if refreshed)
    • Decryption will try AWSCURRENT then AWSPREVIOUS, and if both fails (because encryption by another instance used J2 and [J1, J0] is stored) will request a manual refresh of the secret ([J2, J1] is now stored), and then try AWSCURRENT and AWSPREVIOUS again.
  3. Use three keys in the key window and always encrypt with the middle one, since it should always be in the window of all other instances (unless it was rotated several times, faster than the refresh interval). This adds complexity.

What other options are there? This seems like such a standard use-case but I still struggled to find the best approach.

EDIT ------------------

Based on JoeB's answer, the algorithm I've come up with so far is this: Let's say that initially the secret has the CURRENT value K1, and PENDING value null.

Normal operation

Key rotation

  1. Put a new value K2 for the PENDING stage
  2. wait T seconds -> All services now accept [AWSCURRENT=K1, AWSPENDING=K2]
  3. Add ROTATING to the K1 version + move AWSCURRENT to the K2 version + remove AWSPENDING label from K2 (there seems to be no atomic swapping of labels). Until T seconds have passed, some clients will use K2 and some K1, but all services accept both
  4. wait T seconds -> All services still accept [AWSCURRENT=K2, AWSPENDING=K1] and all clients use AWSCURRENT=K2
  5. Remove the ROTATING stage from K1. Note that K1 will still have the AWSPREVIOUS stage.
  6. After T seconds, all services will only accept [AWSCURRENT=K2], and K1 is effectively dead.

This should work both for separate secrets and for symmetric secrets used for both encryption and decryption.

Unfortunately I don't know how to use the built-in rotation mechanism for this since it requires several steps with delays in between. One idea is to invent some custom steps and have the setSecret step create a CloudWatch cron event that will invoke the function again after T seconds, calling it with steps swapPending and removePending. It would be awesome if SecretsManager could support this automatically, for example by supporting that the function returns a value indicating that the next step should be invoked after T seconds.


Solution

  • For your credential question, you do not have to keep both the current and previous credentials in the application as long as service B supports two active credentials. To do this you must ensure a credential is not marked AWSCURRENT until it is ready. Then the application just always fetches and uses the AWSCURRENT credential. To do this in the rotation lambda you would take the steps:

    1. Store the new credential in secrets manager with the stage label AWSPENDING (if you pass a stage on create the secret is not marked AWSCURRENT). Also use the idempotency token provided to the lambda when you create the secret so you do not create duplicates on retry.
    2. Take the secret stored in secrets manager under the AWSPENDING stage and add it as a credential in service B.
    3. Verify that you can login to service B with the AWSPENDING credential.
    4. Change the stage of the AWSPENDING credential to AWSCURRENT.

    These are the same steps secrets manager takes when it creates a multi-user RDS rotation lambda. Be sure to use the AWSPENDING label because secrets manager treats that specially. If service B does not support two active credentials or multiple users sharing data, there might not be a way to do this. See the secrets manager rotation docs on this.

    In addition, the Secrets Manager rotation engine is asynchronous and will retry after failures (which is why each Lambda step must be idempotent). There are an initial set of retries (on the order of 5) and then some daily retries thereafter. You can take advantage of this by failing the third step (testing the secret) via an exception until the propagation conditions are met. Alternatively, you can up the Lambda execution time to 15 minutes and sleep an appropriate amount of time waiting for propagation to complete. The sleep method, though, has the disadvantage of tying up resources needlessly.

    Keep in mind as soon as you remove the pending stage or move AWSCURRENT to the pending stage, the rotation engine will stop. If application B accept current and pending (or current, pending, and previous if you want to be extra safe), the four steps above will work if you add the delay you described. You can also look at the AWS Secrets Manager Sample Lambdas for examples of how the stages are manipulated for database rotations.

    For your encryption question, the best way I have seen to do this is to store an identifier of the encryption key with the encrypted data. So when you encrypt data D1 with key J1 you either store or otherwise pass to the downstream application something like the secret ARN and version (say V) to the application. If service A is sending encrypted data to service B in a message M(...) it would work as follows:

    1. A fetches the key J1 for stage AWSCURRENT (identified by ARN and version V1).
    2. A encrypts the data D1 as E1 using key J1 and sends it in message M1(ANR, V1, E1) to B.
    3. Later J1 is rotated to J2 and J2 is marked AWSCURRENT.
    4. A fetches the key J2 for stage AWSCURRENT (identified by ARN and V2).
    5. A encrypts the data D2 as E2 using key J2 and sends it in message M2(ANR, V2, E2) to B.
    6. B receives M1 and fetches the key (J1) by specifing ARN, V1 and decrypts E1 to get D1.
    7. B receives M2 and fetches the key (J2) by specifing ARN, V2 and decrypts E2 to get D2.

    Note that the keys can be cached by both A and B. If the encrypted data is to be stored long term, you will have to ensure that a key is not deleted until either the encrypted data no long exists or it gets re-encrypted with the current key. You can also use multiple secrets (instead of versions) by passing different ARNs.

    Another alternative is to use KMS for encryption. Service A would send the encrypted KMS datakey instead of the key identifier along with the encrypted payload. The encrypted KMS data key can be decrypted by B by calling KMS and then use the data key to decrypt the payload.