solrsimilarityminhash

How to use Solr MinHashQParser


Currently I'm trying to integrate Jaccard similarity search using MinHash and I stumbled upon solr's 8.11 MinHash Query Parser and it says in the docs:

The queries measure Jaccard similarity between the query string and MinHash fields

How to correctly implement it?

As docs say, I added <fieldType> and <field> like so:

<field name="min_hash_analysed" type="text_min_hash" multiValued="false" indexed="true" stored="false" />

<fieldType name="text_min_hash" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.ICUTokenizerFactory"/>
            <filter class="solr.ICUFoldingFilterFactory"/>
            <filter class="solr.ShingleFilterFactory" minShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
            <filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" bucketCount="512" hashSetSize="1" hashCount="1"/>
        </analyzer>
    </fieldType>

I tired saving some text to that new min_hash_analysed field and then trying to query very similar text using query provided in the doc.

{!min_hash field="min_hash_analysed" sim="0.5" tp="0.5"}Very similar text to already saved document text

I was hoping to get back all documents that have higher similarity score than sim="0.5", but no matter what I get "numFound":0

Solr query result

Surely I'm doing some thing wrong. How should I correctly integrate Solr's MinHash Query Parser?


Solution

  • According to the response it seems you're sending {!min_hash field..} directly as a query parameter, not as a Solr query as given by the the q= parameter.

    q={!min_hash ..}query text here 
    

    .. would be the correct syntax in the URL (and apply URL escaping as required).