elasticsearchelasticsearch-analyzers

Anchor token replace patterns to the end of tokens


according to the docs this should not be possible

Regular expressions cannot be anchored to the beginning or end of a token

nevertheless it seems to work for me

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "pattern_replace",
      "pattern": "(dog)$",
      "replacement": "hot$1"
    }
  ],
  "text": "dog dogs"
}

returns

{
  "tokens" : [
    {
      "token" : "hotdog",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "dogs",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    }
  ]
}

Note that the pattern is anchored to the end of the token and "dogs" is not replaced because it doesn't end with "dog".

So my question is: Am I missing something or am I safe to use it (and the docs are just wrong)?


Solution

  • Looks like it's the wrong documentation, and Elasticsearch bug for this, have looked at the elastic code, and there is no special handling of the beginning or end of the token.

    Please refer to this ES code which is used for this token filter, it calls the Lucene token filter, and both at Elastic and Lucene code level there is no special handling.