elasticsearchelasticsearch-analyzers

Which analyzer to use for exact match priority?


How to configure elasticsearch to prefer long phrases but with exact match that similiar small words?

For example "long phrase with correct part" and "Paris" and search query "part" should set first phrase to first place, because there is exact match of "part" word and "Paris" is similiar to "part" by letters but it should be on second place.

Current analyzers are ngramAnalyzer, fullWordAnalyzer, rusSnowAnalyzer.

UPDATED I have such configuration:

filter:
    ngramFilter:
        type: 'edge_ngram'
        min_gram: '1'
        max_gram: '40'
    rusSnowFilter:
        type: 'snowball'
        language: 'russian'
analyzer:
    ngramAnalyzer:
        type: 'custom'
        tokenizer: 'standard'
        filter:
            - 'lowercase'
            - 'ngramFilter'
            - 'unique'
        char_filter:
            - 'eCharFilter'
    fullWordAnalyzer:
        type: 'custom'
        tokenizer: 'standard'
        filter:
            - 'lowercase'
            - 'unique'
        char_filter:
            - 'eCharFilter'
    rusSnowAnalyzer:
        type: 'custom'
        tokenizer: 'standard'
        filter:
            - 'lowercase'
            - 'rusSnowFilter'
        char_filter:
            - 'eCharFilter'

Solution

  • Tldr;

    I believe the ngram is the way to go.

    In your case if you were to have configured the ngram t be of length 3. You would get from you search the following tokens: par and art

    But only the first sentence is going to match both.

    paris would yield par, ari, ris. So only one match.

    To reproduce

    PUT 79416374/
    {
      "settings": {
        "analysis": {
          "tokenizer": {
            "3gram": {
              "type": "ngram",
              "min_gram": 3,
              "max_gram": 3,
              "token_chars": [
                "letter",
                "digit"
              ]
            }
          },
          "analyzer": {
            "3gram":{
              "tokenizer": "3gram",
              "filter": [
                "lowercase",
                "asciifolding"
              ]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "data":{
            "type": "text",
            "fields": {
              "3gram": {
                "type": "text",
                "analyzer": "3gram"
              }
            }
          }
        }
      }
    }
    
    POST _bulk
    {"index":{"_index":"79416374"}}
    {"data": "long phrase with correct part"}
    {"index":{"_index":"79416374"}}
    {"data": "Paris"}
    
    GET 79416374/_search
    {
      "query": {
        "match": {
          "data.3gram": "part"
        }
      }
    }