I would like to tokenize the following text :
"text": "king martin"
into
[k, ki, kin, king, i, in, ing, ng, g, m, ma, mar, mart, martin, ar, art, arti, artin, r, rt, rti, rtin, t, ti, tin, i, in, n]
But more especially into :
[kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]
It is a way to get these tokens? I have tried with the following tokenizer, but how to say :"Start at any start_offset ?"
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "15",
"token_chars": [
"letter",
"digit"
]
}
Thank you !
You can use the ngram tokenizer rather than edge_gram.
PUT test_ngram_stack
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"index.max_ngram_diff": 10
}
}
POST test_ngram_stack/_analyze
{
"analyzer": "my_analyzer",
"text": "king martin"
}