Currently I'm trying to integrate Jaccard similarity search using MinHash and I stumbled upon solr's 8.11 MinHash Query Parser and it says in the docs:
The queries measure Jaccard similarity between the query string and MinHash fields
How to correctly implement it?
As docs say, I added <fieldType>
and <field>
like so:
<field name="min_hash_analysed" type="text_min_hash" multiValued="false" indexed="true" stored="false" />
<fieldType name="text_min_hash" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
<filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
</fieldType>
I tired saving some text to that new min_hash_analysed
field and then trying to query very similar text using query provided in the doc.
{!min_hash field="min_hash_analysed" sim="0.5" tp="0.5"}Very similar text to already saved document text
I was hoping to get back all documents that have higher similarity score than sim="0.5"
, but no matter what I get "numFound":0
Surely I'm doing some thing wrong. How should I correctly integrate Solr's MinHash Query Parser?
According to the response it seems you're sending {!min_hash field..}
directly as a query parameter, not as a Solr query as given by the the q=
parameter.
q={!min_hash ..}query text here
.. would be the correct syntax in the URL (and apply URL escaping as required).