I am experimenting Dense Vector Search with Solr 9.4 but weird dot product values are returned by the knn search.
Here is a basic example :
[0.57735027,0.57735027,0.57735027]
in a collection[0.26726124, 0.53452248, 0.80178373]
The dot product should be 0.92582
but the returned score is 0.96291006
The weirdest part is that when I use a streaming expression with the expression dotProduct(array(0.57735027,0.57735027,0.57735027),array(0.26726124,0.53452248,0.80178373))
, Solr return the right value : 0.92582
Any idea why there is such a difference and how could I obtain the right dot product from knn search ?
There is my docker-compose.yaml
file :
version: '3'
services:
solr:
image: solr:9.4
ports:
- "8983:8983"
volumes:
- 'solr_data:/var/solr'
command:
- solr-precreate
- documents
volumes:
solr_data:
driver: local
I add a single vector [0.57735026, 0.57735026, 0.57735026]
(unit vector).
# Create a 3D vector type
curl -X POST \
'http://localhost:8983/api/cores/documents/schema' \
--header 'Content-Type: application/json' \
--data-raw '{
"add-field-type": {
"name": "3D-vector",
"class": "solr.DenseVectorField",
"vectorDimension": "3",
"vectorEncoding": "FLOAT32",
"similarityFunction": "dot_product"
}
}'
# Add a field "vector" in the collection
curl -X POST \
'http://localhost:8983/api/cores/documents/schema' \
--header 'Content-Type: application/json' \
--data-raw '{
"add-field": [
{
"name": "vector",
"type": "3D-vector"
}
]
}'
# Add a single vector (normalized) into the collection "documents"
curl -X POST \
'http://localhost:8983/api/cores/documents/update?commit=true' \
--header 'Content-Type: application/json' \
--data-raw '[
{
"vector": [
0.57735027,
0.57735027,
0.57735027
]
}
]'
Now I perform a knn search with a vector query : [0.26726124, 0.53452248, 0.80178373]
The corresponding dot product should be 0.92582
(same as cosine similarity since I use normalized vectors).
I add a computed field that is using the function query vectorSimilarity in order to double check the returned value of the dot product :
Response :
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"json": "{\n \"fields\": [\n \"vector\",\n \"score\",\n \"vectorSimilarity(FLOAT32, DOT_PRODUCT, vector, [0.26726124, 0.53452248, 0.80178373])\"\n ],\n \"query\": \"{!knn f=vector topK=10}[0.26726124, 0.53452248, 0.80178373]\"\n}"
}
},
"response": {
"numFound": 1,
"start": 0,
"maxScore": 0.96291006,
"numFoundExact": true,
"docs": [
{
"vector": [
0.57735026,
0.57735026,
0.57735026
],
"score": 0.96291006,
"vectorSimilarity(FLOAT32, DOT_PRODUCT, vector, [0.26726124, 0.53452248, 0.80178373])": 0.96291006
}
]
}
}
As we can see the returned value for dot product is 0.96291006
which is significantly different from 0.92582
.
The weirdest thing is that if I use the streaming expression endpoint with the expression dotProduct(array(0.57735027,0.57735027,0.57735027),array(0.26726124,0.53452248,0.80178373))
, Solr compute the right dot product :
curl -X GET \
'http://localhost:8983/solr/documents/stream?expr=dotProduct(array(0.57735027%2C0.57735027%2C0.57735027)%2Carray(0.26726124%2C0.53452248%2C0.80178373))' \
--header 'Content-Type: application/json
Response :
{
"result-set": {
"docs": [
{
"return-value": 0.9258201002207116
},
{
"EOF": true,
"RESPONSE_TIME": 15
}
]
}
}
I have finally understood why the scores seem incorrect thanks to this issue.
It appears that Solr is computing a normalized cosine similarity : (1 + cosine_sim) / 2
which explains why there is a gap between the value I computed and the one returned by the knn search.
To get back the cosine similarity, one can apply the formula : 2 * normalized_cosine_sim - 1
.
For the exemple I gave in my question : 2 * 0.96291006 - 1
gives 0.92582