I have an index on AI Search that contains one term in English (e.g. "white wine", "grapes", "chocolate cake", ...) per document. I have a vector field. Indexing has run without problems for 100k documents.
My use case is to find the closest term to one entered by the user and give a score to the match (0-100%). When I run the following query on Search Explorer on Azure Portal for my index:
{
"search": "Winery products",
"count": true,
"vectorQueries": [
{
"kind": "text",
"text": "Winery products",
"fields": "vectorTextEnglish"
}
]
}
I get the right results. Please note top score is 0.031:
{
"@odata.context": "https://me.search.windows.net/indexes('myindex')/$metadata#docs(*)",
"@odata.count": 75,
"@search.nextPageParameters": {
"select": "chunk_id,Term,MyReference,parent_id",
"count": true,
"skip": 50,
"vectorQueries": [
{
"kind": "text",
"k": null,
"oversampling": null,
"fields": "vectorTextEnglish",
"vector": [],
"text": "Winery products",
"url": null,
"base64Image": null,
"exhaustive": null,
"weight": null,
"filterOverride": null,
"threshold": null
}
]
},
"value": [
{
"@search.score": 0.0317540317773819,
"chunk_id": "xxxx",
"Term": "Alcoholic wines",
"MyReference": "00123",
"parent_id": "yyyyy"
},
{
"@search.score": 0.03159204125404358,
...
},
However, if I ask a random string asdfjiwefowfwe
I get a very similar score 0.030.
{
"@odata.context": "https://me.search.windows.net/indexes('myindex')/$metadata#docs(*)",
"@odata.count": 93,
"@search.nextPageParameters": {
"select": "chunk_id,Term,MyReference,parent_id",
"count": true,
"skip": 50,
"vectorQueries": [
{
"kind": "text",
"k": null,
"oversampling": null,
"fields": "vectorTextEnglish",
"vector": [],
"text": "asdfjiwefowfwe",
"url": null,
"base64Image": null,
"exhaustive": null,
"weight": null,
"filterOverride": null,
"threshold": null
}
]
},
"value": [
{
"@search.score": 0.03083491325378418,
"chunk_id": "xxxxxx",
"Term": "Ash",
"MyReference": "00422",
"parent_id": "yyyyy"
},
{
"@search.score": 0.029877368360757828,
...
},
I would like to normalize the score of the match from 0-100, but I don't understand how does a random string get the same score as a good match. Anyone can help me understand and guide me how to give a higher score if the match is good and 0 for random strings?
I tried setting some thresholds, but since scores are so close to each other, it is impossible. I tried with semantic ranking but it is even more confusing, these random strings get 1.8 reranking score while a perfect match is perhaps 2.4.
Use the parameter "debug": "all"
in your request. Then you will get in the response a new property like "vectorSimilarity": "0.998"
that goes from 0 to 1. In most cases you can then ignore keyword score, since vector search is very accurate. Semantic ranking is an overkill for most use cases.