I want to replace and index with zero-downtime, as described in the ES documentation.
I am doing so by:
my_index_v2
with the new dataPOST /_aliases
{
"actions": [
{ "remove": { "index": "*", "alias": "my_index" }},
{ "add": { "index": "my_index_v2", "alias": "my_index" }}
]
}
This works as expected, except when it randomly fails with 404 response. The error message is:
{
"error": {
"root_cause": ... (same)
"type": "index_not_found_exception",
"reason": "no such index",
"resource.type": "index_or_alias",
"resource.id": "my_unrelated_index_v13",
"index": "my_unrelated_index_v13"
},
"status": 404
}
The whole operation happens periodically every few minutes. Similar operations to the one described might happen at the same time in the cluster, on other aliases/indices. The error happens randomly, every several hours.
Is there a reason why these operations would interfere with each other? What is going on?
EDIT: clarified the DELETE step at the end.
This is difficult to reproduce on a local environment because it seems to only happen on highly concurrent scenarios. However... as pointed out by @Eirini Graonidou in the comments, this really looks like an ES bug, solved in PR 23153
From the pull request (emphasis mine):
This either leads to puzzling responses when a bad request is sent to Elasticsearch (if an index named "bad-request" does not exist then it produces an index not found exception and otherwise responds with the index settings for the index named "bad-request").
This does not explain the "bad request" situation, but definitely explains why the error message does not make sense.
More importantly: Upgrading elasticsearch solves this issue