elasticsearchfull-text-searchtokenanalyzer

Why does elastic search analyze a document 2 times?


From what I've understood, When I index a document say:

PUT <index>/_doc/1
{
   "title":"black white fox cat"
}

Elastic search analyzes this via a standard analyzer and turns the title into an array of tokens.

But then when I search for this document let's say

POST <index>/_search
{
  "query":
  {
    "match":
     {
       "title":"black"
     }
  }
}

It analyzez again via the same analyzer, isn't that inefficient?


Solution

  • It's not efficient, its necessary step to provide the search results. let me explain under the hood, how search and index process works.

    1. Index tokenize the text based on data type, and configured analyzer and index the tokens into the inverted index.
    2. Search terms again is tokenised based on the query type(no tokens in case of term family of queries), and search generated tokens into the inverted index created at index time(step-1).
    3. Tokens match process(matching index time tokens in the inverted index to the tokens generated at the query time), is what finds the matches documents and provides the search results, normally this tokens match is a exact string match process, with the exception in some cases like (prefix query, wildcard query etc). and as its a exact string match, its very fast and optimized process.

    There are various use-cases, like when you use the keywords data type, text is not analyzed and when you use term level queries search time analysis doesn't happen.

    Now, important thing to not is that during search time also same analyzer used at index time, otherwise it would end up generating different token which not produce match in step-3 Described earlier.