azuresearchazure-cognitive-search

Break specific words into multiple tokens


I have this analyzer and it's working as expected but I wanted to tune it up a little bit. So, I had this analyzer mainly to make the dashes not split into 2 tokens. But now, it would also be cool if it could be splitting into 3 tokens. One with the dash on (123-456, as it is doing right now), one without the dash and all together (123456) and one with splitting into 2 tokens (123)(456). I tried messing around with other analyzers but none seem to make it work, does anyone have any ideas on how to approach this?

  scoringProfiles: [
    {
      name: "product_search",
      textWeights:{
        weights:{
          "title":5,
        }
      }
    },
  ],
  charFilters:[
    {
      odatatype:"#Microsoft.Azure.Search.MappingCharFilter",
      name:"dash",
      mappings:["-=>"]
    }
  ],

  analyzers:[
    {
      odatatype:"#Microsoft.Azure.Search.CustomAnalyzer",
      name:"dash-removal",
      tokenizerName:'whitespace',
      tokenFilters:['lowercase']
    }
  ]

}


Solution

  • To achieve the desired tokenization behavior in Azure Cognitive Search, you can create a custom analyzer that generates tokens in three forms:​

    Use a mapping character filter to remove dashes, enabling the creation of the concatenated token without dashes. The keyword_v2 tokenizer treats the entire input as a single token, preserving the original format.​

    This filter splits tokens at delimiter characters (like dashes) and can generate multiple forms of tokens based on its configuration. For more details, see the WordDelimiterTokenFilter Class documentation.​

    Sample Analyzer Configuration:

    {
      "charFilters": [
        {
          "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
          "name": "dash_removal",
          "mappings": ["-=>"]
        }
      ],
      "tokenizers": [
        {
          "@odata.type": "#Microsoft.Azure.Search.KeywordTokenizerV2",
          "name": "keyword_v2"
        }
      ],
      "tokenFilters": [
        {
          "@odata.type": "#Microsoft.Azure.Search.WordDelimiterTokenFilter",
          "name": "word_delimiter",
          "generateWordParts": true,
          "generateNumberParts": true,
          "catenateWords": true,
          "catenateNumbers": true,
          "catenateAll": true,
          "splitOnCaseChange": true,
          "splitOnNumerics": true,
          "preserveOriginal": true
        }
      ],
      "analyzers": [
        {
          "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
          "name": "custom_dash_analyzer",
          "charFilters": ["dash_removal"],
          "tokenizer": "keyword_v2",
          "tokenFilters": ["lowercase", "word_delimiter"]
        }
      ]
    }
    

    This configuration check that for an input like 123-456, the analyzer produces the tokens 123-456, 123456, 123, and 456, accommodating various search scenarios.​

    For further information on token filters and their configurations, consult the TokenFilterName Struct documentation.​