I am currently working with azure search and in order to achieve infix search like searching for 'win' in 'redwine' should find redwine in search results. In azure my configuration for N gram Tokenizer is below
"analyzers": [
{
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "myCustomAnalyzer",
"tokenizer": "nGram",
"tokenFilters": [
"my_NGram"
],
"charFilters": []
}
]
"tokenFilters": [
{
"@odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"name": "my_NGram",
"minGram": 2,
"maxGram": 3
}
]
Now as per my understanding the above configuration should return me tokens for redwine should be Re, Red, ed, Wi, Win, in, ine, ne
but instead when I check the token generated using azure analyze endpoint the tokens generated are below i.e only min grams 2 characters length. what can be missing from this configuration.
{
"@odata.context": "https://trialsearchresource.search.windows.net/$metadata#Microsoft.Azure.Search.V2021_04_30_Preview.AnalyzeResult",
"tokens": [
{
"token": "re",
"startOffset": 0,
"endOffset": 2,
"position": 1
},
{
"token": "ed",
"startOffset": 1,
"endOffset": 3,
"position": 3
},
{
"token": "dw",
"startOffset": 2,
"endOffset": 4,
"position": 5
},
{
"token": "wi",
"startOffset": 3,
"endOffset": 5,
"position": 7
},
{
"token": "in",
"startOffset": 4,
"endOffset": 6,
"position": 9
},
{
"token": "ne",
"startOffset": 5,
"endOffset": 7,
"position": 11
}
]
}
P.S I am using Azure search .Net Core SDK
You are using tokenFilters
and what you are trying to define based on the expected result above is a tokenizer
that allows you to have minimum grams of 2 and maximum of 3 grams. The following definition should help you achieve what you are looking for:
"analyzers": [
{
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "myCustomAnalyzer",
"tokenizer": "myTokenizer",
"charFilters": ["myCharMapping"]
}
],
"tokenizers": [
{
"name":"myTokenizer",
"@odata.type":"#Microsoft.Azure.Search.NGramTokenizer",
"minGram": 2,
"maxGram": 3
}
],
"charFilters": [
{
"@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
"name": "myCharMapping",
"mappings": [
"\\u0020=>"
]
}
]
Note that I have added a charFilter
to remove spaces, since without it, the tokenizer would map also the white space as part of the grams, so for "red wine" there would be grams: "d ", " w", "ed ", " wi" and such.