I have two questions:
similarity_search
. Because by default the function similarity_search
uses euclidean distance and I want e.g. cosine. Ho could I do that?from eurelis_langchain_solr_vectorstore import Solr
embeddings_model = OpenAIEmbeddings(model="bge-small-en")
vector_store = Solr(embeddings_model, core_kwargs={
'page_content_field': 'content', # field containing the text content
'vector_field': 'content_vec', # field containing the embeddings of the text content
'core_name': 'default', # core name
'url_base': 'http://localhost:8983/solr' # base url to access solr
})
# here I want to use cosine distance metric
vector_store.similarity_search("relevant question", k=5)
as_retriever
?# here I want to use cosine distance metric
retriever = vector_store.as_retriever(search_kwargs={'k': 5})
1-2. You can't do it that way. The distance function is a parameter you define in the vector database, that is, in Solr (the content_vec
field type definition, see example below), and it is not meant to change once the vector field is used (ie. indexed) as for other fields.
Also, OpenAI embeddings are normalized to unit length, which means that (cf. FAQ) :
- Cosine similarity and Euclidean distance will result in identical rankings
- Cosine similarity can be computed slightly faster using just a dot product
Solr documentation also states that the preferred way to perform cosine similarity is to normalize all vectors to unit length and use dot_product
as similarity function rather than cosine
(DenseVectorField).
So for example in Solr schema.xml, you would have the following :
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="1536" similarityFunction="dot_product"/>
<field name="content_vec" type="knn_vector" indexed="true" stored="true"/>
Note the vectorDimension
parameter has to match the number of dimensions of your embedding model (eg. 1536 is the default for text-embedding-3-small, 3072 for text-embedding-3-large, etc.).