elasticsearchtokenelasticsearch-query

Using "unique" filter, Elasticsearch analyzes tokens incorrectly


I've been trying to use the **unique ** token filter in my analyzer, but it continue to use duplicate tokens while scoring.

Analyzer:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "tnved_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",    
                        "stemmer",
                        "unique"
                    ]
                }
            }           
        }
    },
    "mappings": {
        "properties": {
            "NAME": {
                "type": "text",
                "analyzer": "tnved_analyzer"
            },
            "CODE": {
                "type": "keyword"
            }
        }
    }
}

Request:

{
  "query": {
    "match_phrase": {
      "NAME": "Pork fresh or chilled"
    }
  }
}

Responce:

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
      **  "max_score": 14.432465,**
        "hits": [
            {
                "_index": "tnved14_code",
                "_type": "_doc",
                "_id": "1oajS4cBEkrkvkWeGRXx",
                "_score": 14.432465,
                "_source": {
                    "CODE": "0203",
                    "NAME": **"Pork fresh or chilled"**
                }
            }
        ]
    }
}

Score for complete coincidence = 14.432465 I expect to get the same score for request "Pork fresh or chilled Pork fresh or chilled" (because here tokens will be the same as in request above "Pork fresh or chilled":

BUT I get score twice higher: 28.864931

I need get 14.432465. What's wrong?

I need get 14.432465. What's wrong?


Solution

  • check your text with the analyze API to understand how it's tokenized. And you can understand how it's scored.

    GET index_name/_analyze
    {
      "text": "Pork fresh or chilled",
      "analyzer": "tnved_analyzer"
    }
    

    Also, the explain API will help you to understand how the score calculated.