regex elasticsearch elasticsearch-analyzers

Unable to understand elasticsearch analyser regex

Can someone help me understand why my understanding of an elasticsearch analyser is wrong?

I have an index containing various fields, one in particular is:

"categories": {
    "type": "text",
    "analyzer": "words_only_analyser",
    "copy_to": "all",
    "fields": {
         "tokens": {
             "type": "text",
             "analyzer": "words_only_analyser",
             "term_vector": "yes",
             "fielddata" : True
          }
      }
}

The words_only_analyser looks like:

"words_only_analyser":{
    "type":"custom",
    "tokenizer":"words_only_tokenizer",
    "char_filter" : ["html_strip"],
    "filter":[ "lowercase", "asciifolding", "stop_filter", "kstem" ]
},

and the words_only_tokenizer looks like:

"tokenizer":{
    "words_only_tokenizer":{
    "type":"pattern",
    "pattern":"[^\\w-]+"
    }
}

My understanding of the pattern [^\\w-]+ in tokenizer is that it will tokenize a sentence such that it splits them at any number of occurrence of \ or w or -. For example, given the pattern, a sentence of:

seasonal-christmas-halloween this is a description about halloween

I expect to see:

[seasonal, christmas, hallo, een this is a description about hallo, een]

I can confirm the above from https://regex101.com/

However, when I run words_only_analyser on the sentence above:

curl -XGET localhost:9200/contextual/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer":"words_only_analyser","text":"seasonal-christmas-halloween this is a description about halloween"}'

I get,

{
  "tokens" : [
    {
      "token" : "seasonal-christmas-halloween",
      "start_offset" : 0,
      "end_offset" : 28,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "description",
      "start_offset" : 39,
      "end_offset" : 50,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "halloween",
      "start_offset" : 57,
      "end_offset" : 66,
      "type" : "word",
      "position" : 6
    }
  ]
}

This tells me the sentence gets tokenized to:

[seasonal-christmas-halloween, description, halloween]

It appears to me the tokenizer pattern is not being fulfilled? Can someone explain to me where my understanding is incorrect?

Solution

There are few things, which is changing the final tokens produced by your analyzer, first is the tokenizer and after that the token-filters(for ex:you have stop_filter that removes the stop words like this, is, a).

you can use the analyze API to test your tokenizer as well, I created your configuration and it produces below tokens.

POST _analyze

{
    "tokenizer": "words_only_tokenizer", // Note `tokenizer` here
    "text": "seasonal-christmas-halloween this is a description about halloween"
}

Result

{
    "tokens": [
        {
            "token": "seasonal-christmas-halloween",
            "start_offset": 0,
            "end_offset": 28,
            "type": "word",
            "position": 0
        },
        {
            "token": "this",
            "start_offset": 29,
            "end_offset": 33,
            "type": "word",
            "position": 1
        },
        {
            "token": "is",
            "start_offset": 34,
            "end_offset": 36,
            "type": "word",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 37,
            "end_offset": 38,
            "type": "word",
            "position": 3
        },
        {
            "token": "description",
            "start_offset": 39,
            "end_offset": 50,
            "type": "word",
            "position": 4
        },
        {
            "token": "about",
            "start_offset": 51,
            "end_offset": 56,
            "type": "word",
            "position": 5
        },
        {
            "token": "halloween",
            "start_offset": 57,
            "end_offset": 66,
            "type": "word",
            "position": 6
        }
    ]
}

You can notice, still stop words are present, as its just breaking the tokens on whitespace and not considering -.

Now if you run same on analyzer which also has filters, it would reduce the stop words and gives you below tokens.

POST _analyze

{
    "analyzer": "words_only_analyser",
    "text": "seasonal-christmas-halloween this is a description about halloween"
}

Result

{
    "tokens": [
        {
            "token": "seasonal-christmas-halloween",
            "start_offset": 0,
            "end_offset": 28,
            "type": "word",
            "position": 0
        },
        {
            "token": "description",
            "start_offset": 39,
            "end_offset": 50,
            "type": "word",
            "position": 4
        },
        {
            "token": "about",
            "start_offset": 51,
            "end_offset": 56,
            "type": "word",
            "position": 5
        },
        {
            "token": "halloween",
            "start_offset": 57,
            "end_offset": 66,
            "type": "word",
            "position": 6
        }
    ]
}