elasticsearchopensearchcosine-similarityamazon-opensearch

how to query opensearch with built-in script knn_score?


based on the documentation provided here - https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script, I created an knn index in opensearch (sample code below)

PUT /test-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 1024,
      }
    }
  }
}

then i added/index some data. also mentioned in the documentation, that we can use script as below to return distance between the vectors you are searching for.

GET my-knn-index-1/_search
{
 "size": 4,
 "query": {
   "script_score": {
     "query": {
       "match_all": {}
     },
     "script": {
       "source": "knn_score",
       "lang": "knn",
       "params": {
         "field": "my_vector2",
         "query_value": [2.0, 3.0, 5.0, 6.0],
         "space_type": "cosinesimil"
       }
     }
   }
 }
}

response to above query looks like this. is the distance between the vectors returned as _score, or am i not doing this right?

{ ... 
  hits: {
   ...
   'max_score': 1.12,
    'hits': [{
       '_index': 'my-knn-index-1',
       ...
       '_score': 1.12, 
       '_source': {
         ....
       }
     }
   ]
  }
}

Solution

  • Great start!!

    Actually, the distance/similarity value cannot be used as score because a score cannot be negative and cosinesimil similarities range from -1 to +1. In order to deal with this, the distance/similarity value needs to be transformed in a way that is appropriate to be used as a score.

    In your case, the score is computed out of the cosinesimil value using a simple formula, which is score = 2 - d, where d is the cosine similarity of your query vector compared to your indexed vectors and is computed as 1 - cos(x). Why 2 you might ask? because the range of possible values for cosine is -1 to +1, which means the range of possible values for the distance functions ranges from 2 to 0. In order to have the highest positive score match the best closest vector, the score inverts the result of the distance function by subtracting the distance value from 2, so that scores ranges from 0 (worst match) to 2 (best match).

    Since the score is 1.12, it'd mean that the effective cosine similarity is 0.88, which means that the query vector has an angle of ~28° compared to the first hit, so pretty close.