azureazure-cognitive-searchn-gramazure-search-.net-sdk

Azure Search N gram Tokenizer Configuration for infix searching


I am currently working with azure search and in order to achieve infix search like searching for 'win' in 'redwine' should find redwine in search results. In azure my configuration for N gram Tokenizer is below

     "analyzers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "myCustomAnalyzer",
      "tokenizer": "nGram",
      "tokenFilters": [
        "my_NGram"
      ],
      "charFilters": []
    }
  ]
"tokenFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2",
      "name": "my_NGram",
      "minGram": 2,
      "maxGram": 3
    }
  ]

Now as per my understanding the above configuration should return me tokens for redwine should be Re, Red, ed, Wi, Win, in, ine, ne but instead when I check the token generated using azure analyze endpoint the tokens generated are below i.e only min grams 2 characters length. what can be missing from this configuration.

{
    "@odata.context": "https://trialsearchresource.search.windows.net/$metadata#Microsoft.Azure.Search.V2021_04_30_Preview.AnalyzeResult",
    "tokens": [
        {
            "token": "re",
            "startOffset": 0,
            "endOffset": 2,
            "position": 1
        },
        {
            "token": "ed",
            "startOffset": 1,
            "endOffset": 3,
            "position": 3
        },
        {
            "token": "dw",
            "startOffset": 2,
            "endOffset": 4,
            "position": 5
        },
        {
            "token": "wi",
            "startOffset": 3,
            "endOffset": 5,
            "position": 7
        },
        {
            "token": "in",
            "startOffset": 4,
            "endOffset": 6,
            "position": 9
        },
        {
            "token": "ne",
            "startOffset": 5,
            "endOffset": 7,
            "position": 11
        }
    ]
}

P.S I am using Azure search .Net Core SDK


Solution

  • You are using tokenFilters and what you are trying to define based on the expected result above is a tokenizer that allows you to have minimum grams of 2 and maximum of 3 grams. The following definition should help you achieve what you are looking for:

     "analyzers": [
        {
          "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
          "name": "myCustomAnalyzer",
          "tokenizer": "myTokenizer",
          "charFilters": ["myCharMapping"]
        }
      ],
      "tokenizers": [
        {
          "name":"myTokenizer",
          "@odata.type":"#Microsoft.Azure.Search.NGramTokenizer",
          "minGram": 2,
          "maxGram": 3      
       }
      ],
      "charFilters": [
        {
          "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
          "name": "myCharMapping",
          "mappings": [
            "\\u0020=>"
          ]
        }
      ]
    

    Note that I have added a charFilter to remove spaces, since without it, the tokenizer would map also the white space as part of the grams, so for "red wine" there would be grams: "d ", " w", "ed ", " wi" and such.