elasticsearchelasticsearch-5elasticsearch-dslelasticsearch-6

Elasticsearch: why exact match has lower score than partial match


my question

I search the word form, but the exact match word form is not the fisrt in result. Is there any way to solve this problem?

my search query

{
  "query": {
    "match": {
      "word": "form"
    }
  }
}

result

word             score
--------------------------
formulation      10.864353
formaldehyde     10.864353
formless         10.864353
formal   10.84412
formerly         10.84412
forma    10.84412
formation        10.574185
formula          10.574185
formulate        10.574185
format   10.574185
formally         10.574185
form     10.254687
former   10.254687
formidable       10.254687
formality        10.254687
formative        10.254687
ill-formed       10.054999
in form          10.035862
pro forma        9.492243

POST my_index/_analyze

The word form in search has only one token form.

In index, form tokens are ["f", "fo", "for", "form"]; formulation tokens are ["f", "fo", ..., "formulatio", "formulation"].

my config

filter

        "edgengram_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }

analyzer

      "analyzer": {
        "abc_vocab_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "keyword_repeat",
            "lowercase",
            "asciifolding",
            "edgengram_filter",
            "unique"
          ]
        },
        "abc_vocab_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "keyword_repeat",
            "lowercase",
            "asciifolding",
            "unique"
          ]
        }
      }

mapping

        "word": {
          "type": "text",
          "analyzer": "abc_vocab_analyzer",
          "search_analyzer": "abc_vocab_search_analyzer"
        }

Solution

  • You get the result in the way you see because you've implemented edge-ngram filter and that form is a sub-string of the words similar to it. Basically in inverted index it would also store the document ids that contains formulation, formal etc.

    Therefore, your relevancy also gets computed in that way. You can refer to this link and I'd specifically suggest you to go through sections Default Similarity and BM25. Although the present default similarity is BM25, that link would help you understand how scoring works.

    You would need to create another sibling field which you can apply in a should clause. You can go ahead and create keyword sub-field with Term Query but you need to be careful about case-sensitivity.

    Instead, as mentioned by @Val, you can create a sibling of text field with standard analyzer.

    Mapping:

       {
        "word":{
          "type": "text",
          "analyzer": "abc_vocab_analyzer",
          "search_analyzer": "abc_vocab_search_analyzer"
          "fields":{
            "standard":{
              "type": "text"
            }
          }
        }
      }
    

    Query:

    POST <your_index_name>/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "word": "form"
              }
            }
          ],
          "should": [                          <---- Note this
            {
              "match": {
                "word.standard": "form"
              }
            }
          ]
        }
      }
    }
    

    Let me know if this helps!