elasticsearchtokenizen-gram

Tokenize each words from any start_offset


I would like to tokenize the following text :

  "text": "king martin"

into

[k, ki, kin, king, i, in, ing, ng, g, m, ma, mar, mart, martin, ar, art, arti, artin, r, rt, rti,  rtin, t, ti, tin, i, in, n]

But more especially into :

 [kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]

It is a way to get these tokens? I have tried with the following tokenizer, but how to say :"Start at any start_offset ?"

  "ngram_tokenizer": {
        "type": "edge_ngram",
        "min_gram": "3",
        "max_gram": "15",
        "token_chars": [
          "letter",
          "digit"
        ]
      }

Thank you !


Solution

  • You can use the ngram tokenizer rather than edge_gram.

    PUT test_ngram_stack
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "ngram",
              "min_gram": 3,
              "max_gram": 10,
              "token_chars": [
                "letter",
                "digit"
              ]
            }
          }
        },
        "index.max_ngram_diff": 10
      }
    }
    
    POST test_ngram_stack/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "king martin"
    }
    

    enter image description here