elasticsearchelasticsearch-6elasticsearch-analyzers

preserve_original the original token in elasticsearch


I have a token filter and analyzer as follows. However, I can't get the original token to be preserved. For example, if I _analyze using the word : saint-louis , I get back only saintlouis, whereas I expected to get both saintlouis and saint-louis as I have my preserve_original set to true. The ES version i am using is 6.3.2 and Lucene version is 7.3.1

"analysis": {
  "filter": {
    "hyphenFilter": {
      "pattern": "-",
      "type": "pattern_replace",
      "preserve_original": "true",
      "replacement": ""
    }
  },
  "analyzer": {
    "whitespace_lowercase": {
      "filter": [
        "lowercase",
        "asciifolding",
        "hyphenFilter"
      ],
      "type": "custom",
      "tokenizer": "whitespace"
    }
  }
}

Solution

  • So looks like preserve_original is not supported on pattern_replace token filters, at least not on the version I am using.

    I made a workaround as follows:

    Index Def

    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_analyzer": {
                        "tokenizer": "whitespace",
                        "type": "custom",
                        "filter": [
                            "lowercase",
                            "hyphen_filter"
                        ]
                    }
                },
                "filter": {
                    "hyphen_filter": {
                        "type": "word_delimiter",
                        "preserve_original": "true",
                        "catenate_words": "true"
                    }
                }
            }
        }
    }
    

    This would, for example, tokenize a word like anti-spam to antispam(removed the hyphen), anti-spam(preserved the original), anti and spam.

    Analyzer API to see generated tokens

    POST /_analyze

    { "text": "anti-spam", "analyzer" : "my_analyzer" }

    Output of analyze API ie. generated tokens

    {
        "tokens": [
            {
                "token": "anti-spam",
                "start_offset": 0,
                "end_offset": 9,
                "type": "word",
                "position": 0
            },
            {
                "token": "anti",
                "start_offset": 0,
                "end_offset": 4,
                "type": "word",
                "position": 0
            },
            {
                "token": "antispam",
                "start_offset": 0,
                "end_offset": 9,
                "type": "word",
                "position": 0
            },
            {
                "token": "spam",
                "start_offset": 5,
                "end_offset": 9,
                "type": "word",
                "position": 1
            }
        ]
    }