elasticsearchelasticsearch-analyzers

Create all possible tokens in order in elasticsearch


I am trying to create an analyzer which can return all possible tokens, for example for this word AB-12-1993 xyz.pdf the tokens generated would be AB, AB-12, -12-1993, 12-1993, -1993, 1993, AB-12-1993 xyz, xyz, xyz.pdf, AB-12-1993 xyz.pdf, if any other extra token is generated that is not an issue. But these should be generated.

I have tried with whitespace analyzer with ngram but these -12-1993, 12-1993, -1993, 1993 are not getting generated.

I have also tried this, with different analyzers but of no help

I am using elasticsearch 8.3.3. Can somebody please help me out here please?


Solution

  • You can use below definition for your analyzer which produces your required tokens

    PUT ngram_custom_example
    {
      "settings": {
        "index": {
          "max_ngram_diff": 10
        },
        "analysis": {
          "analyzer": {
            "default": {
              "tokenizer": "keyword",
              "filter": [ "2_10_grams" ]
            }
          },
          "filter": {
            "2_10_grams": {
              "type": "ngram",
              "min_gram": 2,
              "max_gram": 10
            }
          }
        }
      }
    }