We are using an AWS stack containing OpenSearch. We have several AWS accounts for different customers, but mainly a similar setup for OpenSearch in all accounts: In each AWS account, there is one PROD and one STAGING OpenSearch domain (PROD and STAGING are our way of configuring and using it, not anything AWS provides).
The problem is that from time to time, our master users stop working. Luckily only on the staging domains yet, but who knows? I can't log in to the OpenSearch dashboard with it any more and my app can't via API any more.
Our solution is to "create (another?) master user" using a DIFFERENT password.
So far, it has only happened once on each account, only in staging, but still, it is a real uncomfortable prospect to imagine that happening on PROD or more often than, maybe once a month on a single product.
Do you know what might be happening here? I considered something like upgrades losing the master DB or AWS blocking leaked passwords, although I wouldn't know how a KeePass generated password would leak, unless it REALLY LEAKS which would mean we were in much bigger trouble and in which case I'd expect a message from AWS. My most probable guess is that the clusters single instance has been replaced and the user DB has gone with it. Which would explain why PROD domains do not have this problem, but we would like to not have too many resources on hold for our staging ENV...
Any other ideas?
It's most likely that your instance has been replaced, and therefore the password has been reset. You can check that by going to the tab Instance health -> select an instance -> maximize a graph -> increase the duration to e.g. one week and check for any gaps in the data.
Settings like the password are stored on this instance, but I've also noticed it for internal users and index patterns for dashboards. When the instance is replaced, all this is gone.
You can prevent this in a couple of ways:
In case you're using an instance in the T range, check if it has high CPU usage. For t3.small, you can run out of CPU credits and the baseline utilization per vCPU is 20% Consider using a non T range instance type.
Use more powerful nodes
Add more nodes