I'm tring to query with multi words synonym including a stop word. Let's start with an exemple to explain.
I've got the following documents into a index.
Expected result with the query {"query":{"match":{"test":{"query":"foo of bar"}}}}
is to return documents:
In this exemple, I got 2 filters:
{
"properties": {
"test": {
"type": "text",
"analyzer": "test_index_analyzer",
"search_analyzer": "test_search_analyzer"
}
}
{
"settings" : {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"test_index_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"english_stop"
]
},
"test_search_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"english_stop",
"english_syn"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_",
"ignore_case": true,
"remove_trailing": false
},
"english_syn": {
"type": "synonym_graph",
"synonyms": [
"fb,foo of bar",
"fb,foo bar"
]
}
}
}
}
}
}
token format: "token,start_offset-end_offset,type / position / positionLength"
Query | Search Result | index analysys | Search analysys |
---|---|---|---|
fb | fb | fb,0-2,word,0,1 | foo,0-2,SYNONYM / 0 / 1 foo,0-2,SYNONYM / 0 / 3 fb,0-2,word / 0 / 4 bar,0-2,SYNONYM / 2 / 2 bar,0-2,SYNONYM / 3 / 1 |
foo of bar | fb | foo,0-3,word,0,1 bar,7-10,word,2,1 |
fb,0-10,SYNONYM / 0 / 3 foo,0-3,word / 0 / 1 bar,7-10,word / 2 / 1 |
foo bar | fb,foo bar | foo,0-3,word,0,1 bar,4-7,word,1,1 |
fb,0-7,SYNONYM / 0 / 2 foo,0-3,word / 0 / 1 bar,4-7,word / 1 / 1 |
All search expect to return the 3 lines:
Note: foo of bar is never returned
My guess is than foo of bar got indexed with position [foo, ,bar] by the stop filter and synonym is looking for [foo, bar].
Do you have any advice to reach my goal ?
When you use stopwords filter the position of word will be kept so if you check the analyzer result for foo of bar you will get below result:
{
"tokens" : [
{
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "bar",
"start_offset" : 7,
"end_offset" : 10,
"type" : "word",
"position" : 2
}
]
}
As you can see you get 'foo' token in position of zero and 'bar' in position of two, so you synonym filter can't find this document.
To solve your problem you should first apply synonym filter and then remove stop words like below.
"test_search_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"english_syn",
"english_stop"
]
}
and you should add 'foo bar, foo of bar' to your synonym list.
In my opinion keeping stop word is necessary because it can help getting more precise search results(especially with BM25 similarity that ES uses.), you can check elastic search official article about it here.