elasticsearchquery-analyzer

How to make elasticsearch disregard spaces between certain queries?


My elasticsearch documents have a field Name with entries like:

Samsung Galaxy S3
Samsung Galaxy Ace Duos 3
Samsung Galaxy Duos 3
Samsung Galaxy S2
Samsung Galaxy S (I9000)

On querying this field with the following query (notice the space between "s" and "3"):

{
  "query": {
    "match": {
      "Name": {
        "query": "galaxy s 3",
        "fuzziness": 2,
        "prefix_length": 1
      }
    }
  }
}

It returns "Samsung Galaxy Duos 3" as a relevant result, and not "Samsung Galaxy S3".

The pattern I notice for such a task is to disregard the space between any number and any single alphabetical character, and make the query. For example then, "I-phone 5s" should also be returned by "I-phone 5 s".

Is there a nice way to accomplish this?


Solution

  • You need to change your analyser to break up the string on a change from text to number - using a regular expression would help (this is based on the camelcase analyser):

    curl -XPUT 'localhost:9200/myindex/' -d '
         {
             "settings":{
                 "analysis": {
                     "analyzer": {
                         "mynewanalyser":{
                             "type": "pattern",
                             "pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)"
                         }
                     }
                 }
             }
         }'
    

    testing the new analyser with your string:

    -XGET 'localhost:9200/myindex/_analyze?analyzer=mynewanalyser&pretty' -d 'Samsung Galaxy S3'
    {
      "tokens" : [ {
        "token" : "samsung",
        "start_offset" : 0,
        "end_offset" : 7,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "galaxy",
        "start_offset" : 8,
        "end_offset" : 14,
        "type" : "word",
        "position" : 2
      }, {
        "token" : "s",
        "start_offset" : 15,
        "end_offset" : 16,
        "type" : "word",
        "position" : 3
      }, {
        "token" : "3",
        "start_offset" : 16,
        "end_offset" : 17,
        "type" : "word",
        "position" : 4
      } ]
    }