solrlucenesearch-engineinformation-retrieval

How to get lexical search score and vector search score in a hybrid search on Apahce Solr?


I was able to implement a hybrid search engine on Apache Solr 9.6.1 that combines lexical search (edismax) and vector search (KNN-based embeddings) within a single request. The idea is simple:

  1. Lexical Search retrieves results based on text relevance.
  2. Vector Search retrieves results based on semantic similarity.
  3. Hybrid Scoring sums both scores, where a missing score (if a document appears in only one search) should be treated as zero.

This approach is working, but I am unable to properly return individual score components of lexical search (score1) vs. vector search (score2 from cosine similarity). Right now, Solr only returns the final combined score, but there is no clear way to see how much of that score comes from lexical search vs. vector search. The following is a code snippet:

def hybrid_search(query, top_k=10):
embedding = np.array(embed([query]), dtype=np.float32
embedding = list(embedding[0])
lxq= rf"""{{!type=edismax 
            qf='text'
            q.op=OR
            tie=0.1
            bq=''
            bf=''
            boost=''
        }}({query})"""
solr_query = {"params": {
    "q": "{!bool filter=$retrievalStage must=$rankingStage}",
    "rankingStage": "{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))",
    "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", # Union
    "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)",
    "lexicalQuery": lxq,
    "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}",
    "fl": "text, score",
    "rows": top_k,
    "fq": [""],
    "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}",
    "rqq": "{!frange l=$cutoff}query($rankingStage)",
    "sort": "score desc",
}}
response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query)
response = response.json()
return response

Before the retrieval stage, I scale the scores of the keyword search to be between 0 and 1 to make them similar to the scores of the vector search, since the scores in the keyword search are unbounded. After that, I take the union between both retrievals, where the scores get added. For example, if post x was retrieved in both the keyword search and the vector search, with scores 0.6 and 0.5 respitevly, the final score would be 0.6 + 0.5 = 1.1. If a post was retrieved in one but not the other, it would get simply added by 0. Now the question is how can I retrieve both scores separately (not the final score) when using hybrid search, and without having to send two different requests to Solr, as in the hybrid approach, I only send one request for both retrievals.

I have tried to include "score" in the "fl" field, but it only shows the final score (I assume).


Solution

  • The function "query" returns a score from the given query for each document.

    The result of the query function is casted to a variable named "normalisedLexicalQuery". Just use the variable name with a dollar sign "$" in the fl field.

    so you can use "fl":"lexical_score:$normalisedLexicalQuery,vector_score:query($vectorQuery)"