How to configure elasticsearch to prefer long phrases but with exact match that similiar small words?
For example "long phrase with correct part" and "Paris" and search query "part" should set first phrase to first place, because there is exact match of "part" word and "Paris" is similiar to "part" by letters but it should be on second place.
Current analyzers are ngramAnalyzer
, fullWordAnalyzer
, rusSnowAnalyzer
.
UPDATED I have such configuration:
filter:
ngramFilter:
type: 'edge_ngram'
min_gram: '1'
max_gram: '40'
rusSnowFilter:
type: 'snowball'
language: 'russian'
analyzer:
ngramAnalyzer:
type: 'custom'
tokenizer: 'standard'
filter:
- 'lowercase'
- 'ngramFilter'
- 'unique'
char_filter:
- 'eCharFilter'
fullWordAnalyzer:
type: 'custom'
tokenizer: 'standard'
filter:
- 'lowercase'
- 'unique'
char_filter:
- 'eCharFilter'
rusSnowAnalyzer:
type: 'custom'
tokenizer: 'standard'
filter:
- 'lowercase'
- 'rusSnowFilter'
char_filter:
- 'eCharFilter'
I believe the ngram is the way to go.
In your case if you were to have configured the ngram t be of length 3.
You would get from you search the following tokens: par
and art
But only the first sentence is going to match both.
paris
would yield par
, ari
, ris
. So only one match.
PUT 79416374/
{
"settings": {
"analysis": {
"tokenizer": {
"3gram": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"3gram":{
"tokenizer": "3gram",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"data":{
"type": "text",
"fields": {
"3gram": {
"type": "text",
"analyzer": "3gram"
}
}
}
}
}
}
POST _bulk
{"index":{"_index":"79416374"}}
{"data": "long phrase with correct part"}
{"index":{"_index":"79416374"}}
{"data": "Paris"}
GET 79416374/_search
{
"query": {
"match": {
"data.3gram": "part"
}
}
}