elasticsearch

elasticsearch scoring through multiple indices


I have several indices with similar but still different data which is coming from different sources. However I'm running the search query through all of them.

The issue that I'm trying to solve is scoring.

Simplified index mapping - I have the same field/type for searching in all indices:

PUT /dl-delme1
{
  "settings": {
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "text": { "type": "text" }
    }
  }
}

PUT /dl-delme2
{
  "settings": {
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "text": { "type": "text" }
    }
  }
}

Documents - first index contains only 2 documents and the second contains 9 documents :

//1st index
POST /dl-delme1/_doc/1
{ "text":"index me" }

POST /dl-delme1/_doc/2
{ "text":"search me" }

//2nd index
POST /dl-delme2/_doc/1
{ "text":"index search" }

POST /dl-delme2/_doc/2
{ "text":"index search hoho 2" }

POST /dl-delme2/_doc/3
{ "text":"index search hoho 3" }

POST /dl-delme2/_doc/4
{ "text":"index search hoho 4" }

POST /dl-delme2/_doc/5
{ "text":"index search hoho 5" }

POST /dl-delme2/_doc/6
{ "text":"index search hoho 6" }

POST /dl-delme2/_doc/7
{ "text":"index search hoho 7" }

POST /dl-delme2/_doc/8
{ "text":"index search hoho 8" }

POST /dl-delme2/_doc/9
{ "text":"index search hoho 9" }

Search query:

POST /dl-delme*/_search
{
  "query": {
    "bool": {
      "must": {
        "multi_match": {
          "analyzer": "whitespace",
          "auto_generate_synonyms_phrase_query": false,
          "fields": [ "text" ],
          "query": "search index",
          "type": "most_fields"
        }
      }
    }
  }
}

I'm expecting to see as a first document in result any document from second index because both words from search phrase appeared in every document.

However the documents from first index are jumping to the top with much higher score. but every document in the first index contains only one word from search phrase.

Real Result:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 11,
      "relation": "eq"
    },
    "max_score": 0.6931471,
    "hits": [{
        "_index": "dl-delme1",
        "_id": "1",
        "_score": 0.6931471,
        "_source": {
          "text": "index me"
        }
      }, {
        "_index": "dl-delme1",
        "_id": "2",
        "_score": 0.6931471,
        "_source": {
          "text": "search me"
        }
      }, {
        "_index": "dl-delme2",
        "_id": "1",
        "_score": 0.1270443,
        "_source": {
          "text": "index search"
        }
      }, {
        "_index": "dl-delme2",
        "_id": "2",
        "_score": 0.10017593,
        "_source": {
          "text": "index search hoho 2"
        }
      }
    ]
  }
}

I tried adding the parameter search_type=dfs_query_then_fetch and this provides the expected result. However the official elastic documentation is saying:

dfs_query_then_fetch: Documents are scored using global term and document frequencies across all shards. This is usually slower but more accurate.


Question: Is there a way to score the documents just by number of words found - so the score will not rely on number of documents in shard/index? Or maybe there is another way to score documents equally across multiple indices/shards without using dfs_query_then_fetch...


Solution

  • Elasticsearch uses BM25 algorithm as default, which calculate the score with tf and idf. For more information check the following article. https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables

    The overall score calculate by 2.2 x tf x idf. You can check this by adding "explain": true parameter in the DSL query.

    The problem in your case is because of idf. The IDF score is calculated based on the number of documents in the shards. Because the first index has a much smaller number of documents, this significantly increases the IDF score for the first index. If there were more than a thousand documents in both indices, this situation would not have happened.

    Workaround

    If you want more simple score algorithm, you can change the similarity model from BM25 to boolean.

    boolean: A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost.

    POST /dl-delme*/_close
    
    PUT /dl-delme*/_settings
    {
      "index": {
        "similarity": {
          "default": {
            "type": "boolean"
          }
        }
      }
    }
    
    POST /dl-delme*/_open
    

    enter image description here