elasticsearchelasticsearch-2.0

Elasticsearch shingles and stopwords


The example at https://www.elastic.co/guide/en/elasticsearch/guide/current/shingles.html mentions that the standard filter for stopwords introduces a negative effect when searching with shingles, due to the filter replacing stopwords with an underscore and producing tokens with underscores (which won't match "regular" text queries).

However, it suggests using a enable_position_increments parameter that is not supported by Lucene anymore (and produces an error at least on ES 2.4).

Is there anyway to solve this problem, or achieve the same results, without using the unsupported enable_position_increments? Or are the underscores a minor problem that can be worked around?

I was also thinking if this could be a non issue if you use the same analyzer for search and indexing: if a query includes stopwords, will they be replaced by _ and thus generate tokens that will match the indexed shingles (even if the stopwords were different)?


Solution

  • I've found that a possible solution is to set the filler_token parameter to an empty string on the shingle filter, so the underscore will simply be omitted from the tokens:

    "filter_shingle": {
                    "type": "shingle",
                    "max_shingle_size": 5,
                    "min_shingle_size": 2,
                    "output_unigrams": "false",
                    "filler_token": ""
                }
    

    Can someone comment on whether this achieves the same results, or if it creates any unforeseen problems concerning scoring or matching? The results from _analyze seem correct, the _ is omitted.