I'm using a custom index analyzer to remove a certain set of stop words. I'm then making phrase match queries with text that includes some of the stop words. I would expect that the stop words get filtered out of the query, however they are not (and any documents that do not include them are being excluded from the results).
Here's a simplified example of what I'm trying to do:
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Create index, with a custom analyzer to filter out the word 'foo'
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
"settings": {
"analysis": {
"analyzer": {
"fooAnalyzer": {
"type": "custom",
"tokenizer": "letter",
"filter": [
"fooFilter"
]
}
},
"filter": {
"fooFilter": {
"type": "stop",
"stopwords": [
"foo"
]
}
}
}
},
"mappings": {
"myDocument": {
"properties": {
"myMessage": {
"analyzer": "fooAnalyzer",
"type": "string"
}
}
}
}
}'
# Add sample document
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"myDocument"}}
{"myMessage":"bar baz"}
'
If I perform a phrase_match
search against this index with a filtered stop word in the middle of the query, I would expect it to match (since 'foo'
should be filtered away by our analyzer).
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "bar foo baz"
}
}
}
}
'
However, I get no results.
Is there a way to instruct Elasticsearch to tokenize and filter the query string before performing the search?
Edit 1: now I'm even more confused. I was seeing before that phrase matching wasn't working if my query contained stop words in the middle of the query text. Now, in addition, I'm seeing that the phrase query does not work if the document contains stop words in the middle of the query text. Here's a minimal example, still using the mapping from above.
POST play/myDocument
{
"myMessage": "fib foo bar" <---- remember that 'foo' is a stopword and is filtered out of analysis
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}
This query does not match. I'm very surprised by this! I would expect the foo stop word to be filtered out and ignored.
For an example of why I'd expect this, see this query:
POST play/myDocument
{
"myMessage": "fib 123 bar"
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}
This matches, because the '123'
is filtered out by my 'letter'
tokenizer. It seems like phrase matching is ignoring the stop word filtering completely, and acting as if those tokens were in the analyzed field all along (even though they don't show up in the list of tokens from _analyze).
My current best idea for a workaround:
_analyze
endpoint against my document's text string using my custom analyzer. this will return the tokens from the original text string but remove the pesky stop words for me"filtered"
field in the documentLater, at query time:
_analyze
endpoint against my query string using my custom analyzer to get just the tokens"filtered"
fieldIt turns out that if you want to use phrase matching, the token filter is too late to remove unwanted words. By that point, the position
field of your significant tokens is polluted by the existence of the filtered tokens and the phrase matching refuses to work.
The answer - filter before we get to the token filter level. I created a char_filter
that removes our unwanted term and phrase matching started working correctly!
PUT play
{
"settings": {
"analysis": {
"analyzer": {
"fooAnalyzer": {
"type": "custom",
"tokenizer": "letter",
"char_filter": [
"fooFilter"
]
}
},
"char_filter": {
"fooFilter": {
"type": "pattern_replace",
"pattern": "(foo)",
"replacement": ""
}
}
}
},
"mappings": {
"myDocument": {
"properties": {
"myMessage": {
"analyzer": "fooAnalyzer",
"type": "string"
}
}
}
}
}
Queries:
POST play/myDocument
{
"myMessage": "fib bar"
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib foo bar"
}
}
}
}
and
POST play/myDocument
{
"myMessage": "fib foo bar"
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}
both now work!