elasticsearchmatchn-gramrelevance

ngram matching gives same score to less relevant documents


I am searching for Bob Smith in my elasticsearch index. The results Bob Smith and Bobbi Smith both come back in the response with the same score. I want Bob Smith to have a higher score so that it appears first in my result set. Why are the scores the equivalent?

Here is my query

{
    "query": {
        "query_string": {
            "query": "Bob Smith",
            "fields": [
                "text_field"
            ]
        }
    }
} 

Below are my index's settings. I am using the ngram token filter described here: https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch

{
    "contacts_5test": {
        "aliases": {},
        "mappings": {
            "properties": {
                "text_field": {
                    "type": "text",
                    "term_vector": "yes",
                    "analyzer": "ngram_filter_analyzer"
                }
            }
        },
        "settings": {
            "index": {
                "number_of_shards": "1",
                "provided_name": "contacts_5test",
                "creation_date": "1588987227997",
                "analysis": {
                    "filter": {
                        "ngram_filter": {
                            "type": "nGram",
                            "min_gram": "4",
                            "max_gram": "4"
                        }
                    },
                    "analyzer": {
                        "ngram_filter_analyzer": {
                            "filter": [
                                "lowercase",
                                "ngram_filter"
                            ],
                            "type": "custom",
                            "tokenizer": "standard"
                        }
                    }
                },
                "number_of_replicas": "1",
                "uuid": "HqOXu9bNRwCHSeK39WWlxw",
                "version": {
                    "created": "7060199"
                }
            }
        }
    }
}

Here are the results from my query...

"hits": [
  {
    "_index": "contacts_5test",
    "_type": "_doc",
    "_id": "1",
    "_score": 0.69795835,
    "_source": {
      "text_field": "Bob Smith"
    }
  },
  {
    "_index": "contacts_5test",
    "_type": "_doc",
    "_id": "2",
    "_score": 0.69795835,
    "_source": {
      "text_field": "Bobbi Smith"
    }
  }
]

If I instead search for Bobbi Smith, elastic returns both documents, but with a higher score for Bobbi Smith. This makes more sense.


Solution

  • I was able to reproduce your issue and reason for this is due to the use of your ngram_filter, which doesn't create any token for bob as the minimum length of the token should be 4 while standard tokenizer created bob token but then it gets filtered out in your ngram_filter where you mentioned min_gram as 4.

    Even I tried with less min_gram length to 3, which would create the tokens but the issue is that both bob and bobbie will have same bob tokens, hence score for both of them will be same.

    While when you search for Bobbi Smith, then bobbi ie exact token will be present only in one document, hence you get the higher score.

    Note:- Please use the analyze API and explain API to inspect the tokens generated and how these are matched, this would help you to understand the issue and my explanation in details and my