elasticsearchelasticsearch-2.4

Atomic alias swap fails with index_not_found_exception on a totally unrelated index


I want to replace and index with zero-downtime, as described in the ES documentation.

I am doing so by:

POST /_aliases

{
    "actions": [
        { "remove": { "index": "*", "alias": "my_index" }},
        { "add":    { "index": "my_index_v2", "alias": "my_index" }}
    ]
}

This works as expected, except when it randomly fails with 404 response. The error message is:

{
   "error": {
      "root_cause": ... (same)
      "type": "index_not_found_exception",
      "reason": "no such index",
      "resource.type": "index_or_alias",
      "resource.id": "my_unrelated_index_v13",
      "index": "my_unrelated_index_v13"
   },
   "status": 404
}

The whole operation happens periodically every few minutes. Similar operations to the one described might happen at the same time in the cluster, on other aliases/indices. The error happens randomly, every several hours.

Is there a reason why these operations would interfere with each other? What is going on?

EDIT: clarified the DELETE step at the end.


Solution

  • This is difficult to reproduce on a local environment because it seems to only happen on highly concurrent scenarios. However... as pointed out by @Eirini Graonidou in the comments, this really looks like an ES bug, solved in PR 23153

    From the pull request (emphasis mine):

    This either leads to puzzling responses when a bad request is sent to Elasticsearch (if an index named "bad-request" does not exist then it produces an index not found exception and otherwise responds with the index settings for the index named "bad-request").

    This does not explain the "bad request" situation, but definitely explains why the error message does not make sense.

    More importantly: Upgrading elasticsearch solves this issue