elasticsearchelasticsearch-2.0elasticsearch-5

How to match a phrase in elastic-search with expandable prefix and suffix?


We have a use case in which we want to match phrases in elastic-search, but in addition to phrase query we also want to search partial phrases.

Example:

Search phrase: "welcome you" or "lcome you" or "welcome yo" or "lcome yo" this should match to documents containing phrases:

"welcome you"

"we welcome you"

"welcome you to"

"we welcome you to"

i.e. we want to maintain the ordering of words by doing a phrase query with added functionality that is returns us results which contains phrase as a partial substring and with prefix and suffix expandable to certain configurable length. In elastic I found something similar 'match_phrase_prefix' but it only match phrases which starts with a particular prefix.

Ex return results starting with d prefix:

$ curl -XGET localhost:9200/startswith/test/_search?pretty -d '{
    "query": {
        "match_phrase_prefix": {
            "title": {
                "query": "d",
                "max_expansions": 5
            }
        }
    }
}'

Is there any way that I could achieve this for suffix as well ?


Solution

  • I would strongly encourage you to look into the shingle token filter.

    You can define an index with a custom analyzer that leverages shingles in order to index a set of subsequent tokens together in addition to the tokens themselves.

    curl -XPUT localhost:9200/startswith -d '{
      "settings": {
          "analysis": {
            "analyzer": {
              "my_shingles": {
                "tokenizer": "standard",
                "filter": [
                  "lowercase",
                  "shingles"
                ]
              }
            },
            "filter": {
              "shingles": {
                "type": "shingle",
                "min_shingle_size": 2,
                "max_shingle_size": 2,
                "output_unigrams": true
              }
            }
          }
      },
      "mappings": {
        "test": {
          "properties": {
            "title": {
              "type": "text",
              "analyzer": "my_shingles"
            }
          }
        }
      }
    }'
    

    For instance, we welcome you to would be indexed as the following tokens

    Then you can index a few sample documents:

    curl -XPUT localhost:9200/startswith/test/_bulk -d '
    {"index": {}}
    {"title": "welcome you"}
    {"index": {}}
    {"title": "we welcome you"}
    {"index": {}}
    {"title": "welcome you to"}
    {"index": {}}
    {"title": "we welcome you to"}
    '
    

    Finally, you can run the following query to match all four documents above, like this:

    curl -XPOST localhost:9200/startswith/test/_search -d '{
       "query": {
           "match": {"title": "welcome you"}
       }
    }'
    

    Note that this approach is more powerful than the match_phrase_prefix query, because it allows you to match subsequent tokens anywhere in your body of text, whether at the beginning or the end.