i use Elasticsearch N-gram tokenizer
and use match_phrase
to fuzzy match
my index and test data as below:
DELETE /m8
PUT m8
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 3,
"custom_token_chars":"_."
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"table": {
"properties": {
"dataSourceId": {
"type": "long"
},
"dataSourceType": {
"type": "integer"
},
"dbName": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
PUT /m8/table/1
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rm.rf"
}
PUT /m8/table/2
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rm_rf"
}
PUT /m8/table/3
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rmrf"
}
check _analyze:
POST m8/_analyze
{
"tokenizer": "my_tokenizer",
"text": "rm.rf"
}
_analyze result:
{
"tokens" : [
{
"token" : "r",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "rm",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "rm.",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "m",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 3
},
{
"token" : "m.",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "m.r",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : ".",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 6
},
{
"token" : ".r",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : ".rf",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "r",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 9
},
{
"token" : "rf",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 10
},
{
"token" : "f",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 11
}
]
}
When i search 'rm', nothing found:
GET /m8/table/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"dbName": "rm"
}
}
]
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
But '.rf' can be found:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.7260926,
"hits" : [
{
"_index" : "m8",
"_type" : "table",
"_id" : "1",
"_score" : 1.7260926,
"_source" : {
"dataSourceId" : 1,
"dataSourceType" : 2,
"dbName" : "rm.rf"
}
}
]
}
}
My question: Why 'rm' couldn't been found even _analyze has splited these phrase?
my_analyzer will be used during search time as well.
"mapping":{
"dbName": {
"type": "text",
"analyzer": "my_analyzer"
"search_analyzer":"my_analyzer" // <==== If you don't provide a search analyzer then what you defined in analyzer will be used during search time as well.
Match_phrase query is used to match phrases considering the position of analyzed text. e.g Searching for "Kal ho" will match document having "Kal" at position X, & "ho" at position X+1 in the analyzed text.
When you are searching for 'rm' (#1) the text gets analyzed using my_analyzer, which converts it into n-gram and on the top of that phrase_search will be used. Hence the outcome is not expected.
Solution:
Use standard analyzer with simple match query
GET /m8/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"dbName": {
"query": "rm",
"analyzer": "standard" // <=========
}
}
}
]
}
}
}
OR Define during mapping & use a match query (not match_phrase)
"mapping":{
"dbName": {
"type": "text",
"analyzer": "my_analyzer"
"search_analyzer":"standard" //<==========
Followup Question: Why do you want to use a match_phrase query with n-gram tokenizer?