regexelasticsearchelasticsearch-analyzers

Unable to understand elasticsearch analyser regex


Can someone help me understand why my understanding of an elasticsearch analyser is wrong?

I have an index containing various fields, one in particular is:

"categories": {
    "type": "text",
    "analyzer": "words_only_analyser",
    "copy_to": "all",
    "fields": {
         "tokens": {
             "type": "text",
             "analyzer": "words_only_analyser",
             "term_vector": "yes",
             "fielddata" : True
          }
      }
}

The words_only_analyser looks like:

"words_only_analyser":{
    "type":"custom",
    "tokenizer":"words_only_tokenizer",
    "char_filter" : ["html_strip"],
    "filter":[ "lowercase", "asciifolding", "stop_filter", "kstem" ]
},

and the words_only_tokenizer looks like:

"tokenizer":{
    "words_only_tokenizer":{
    "type":"pattern",
    "pattern":"[^\\w-]+"
    }
}

My understanding of the pattern [^\\w-]+ in tokenizer is that it will tokenize a sentence such that it splits them at any number of occurrence of \ or w or -. For example, given the pattern, a sentence of:

seasonal-christmas-halloween this is a description about halloween

I expect to see:

[seasonal, christmas, hallo, een this is a description about hallo, een]

I can confirm the above from https://regex101.com/

However, when I run words_only_analyser on the sentence above:

curl -XGET localhost:9200/contextual/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer":"words_only_analyser","text":"seasonal-christmas-halloween this is a description about halloween"}'

I get,

{
  "tokens" : [
    {
      "token" : "seasonal-christmas-halloween",
      "start_offset" : 0,
      "end_offset" : 28,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "description",
      "start_offset" : 39,
      "end_offset" : 50,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "halloween",
      "start_offset" : 57,
      "end_offset" : 66,
      "type" : "word",
      "position" : 6
    }
  ]
}

This tells me the sentence gets tokenized to:

[seasonal-christmas-halloween, description, halloween]

It appears to me the tokenizer pattern is not being fulfilled? Can someone explain to me where my understanding is incorrect?


Solution

  • There are few things, which is changing the final tokens produced by your analyzer, first is the tokenizer and after that the token-filters(for ex:you have stop_filter that removes the stop words like this, is, a).

    you can use the analyze API to test your tokenizer as well, I created your configuration and it produces below tokens.

    POST _analyze

    {
        "tokenizer": "words_only_tokenizer", // Note `tokenizer` here
        "text": "seasonal-christmas-halloween this is a description about halloween"
    }
    

    Result

    {
        "tokens": [
            {
                "token": "seasonal-christmas-halloween",
                "start_offset": 0,
                "end_offset": 28,
                "type": "word",
                "position": 0
            },
            {
                "token": "this",
                "start_offset": 29,
                "end_offset": 33,
                "type": "word",
                "position": 1
            },
            {
                "token": "is",
                "start_offset": 34,
                "end_offset": 36,
                "type": "word",
                "position": 2
            },
            {
                "token": "a",
                "start_offset": 37,
                "end_offset": 38,
                "type": "word",
                "position": 3
            },
            {
                "token": "description",
                "start_offset": 39,
                "end_offset": 50,
                "type": "word",
                "position": 4
            },
            {
                "token": "about",
                "start_offset": 51,
                "end_offset": 56,
                "type": "word",
                "position": 5
            },
            {
                "token": "halloween",
                "start_offset": 57,
                "end_offset": 66,
                "type": "word",
                "position": 6
            }
        ]
    }
    

    You can notice, still stop words are present, as its just breaking the tokens on whitespace and not considering -.

    Now if you run same on analyzer which also has filters, it would reduce the stop words and gives you below tokens.

    POST _analyze

    {
        "analyzer": "words_only_analyser",
        "text": "seasonal-christmas-halloween this is a description about halloween"
    }
    

    Result

    {
        "tokens": [
            {
                "token": "seasonal-christmas-halloween",
                "start_offset": 0,
                "end_offset": 28,
                "type": "word",
                "position": 0
            },
            {
                "token": "description",
                "start_offset": 39,
                "end_offset": 50,
                "type": "word",
                "position": 4
            },
            {
                "token": "about",
                "start_offset": 51,
                "end_offset": 56,
                "type": "word",
                "position": 5
            },
            {
                "token": "halloween",
                "start_offset": 57,
                "end_offset": 66,
                "type": "word",
                "position": 6
            }
        ]
    }