solrlucenesolr4

How to get an "ends with" search in Solr 4.8.1?


I have a document, indexed on Solr, which contains this field:

{
    "manufacturerSkuEndsWith": [
        "DU351118DR0"
    ]
}

My goal is to get an "ends with" search on the manufacturerSkuEndsWith field. For example, the following queries should match the value above: DR0, 8DR0, 18DR0, 118DR0... but these queries should NOT match: DU35, 118DR, 118...

My problem is that the query 118 matches that document, even though DU351118DR0 does not end with 118.

My Solr & Lucene version is 4.8.1. I've found out that in this version the side="back" for the EdgeNGramTokenizer is not supported anymore: LUCENE-3907. In this thread, they are suggesting to use a ReverseStringFilter to get a behaviour similar to an EdgeNGramTokenizer with side="back", so this is how I configured the manufacturerSkuEndsWith field in my schema.xml:

<field indexed="true" multiValued="true" name="manufacturerSkuEndsWith" stored="true" type="smccTextReversedNGram"/>

<copyField dest="manufacturerSkuEndsWith" source="ManufacturerSku"/>

<fieldType class="solr.TextField" name="smccTextReversedNGram" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="10" minGramSize="3"/>
        <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReverseStringFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReverseStringFilterFactory"/>
    </analyzer>
</fieldType>

but this configuration does not perform an "ends with" search:

screenshot from the Solr analysis tool

How can I get that type of search, instead?


Solution

  • You're using the NGramTokenizer and not the EdgeNGramFilter as shown in the examples. The NgramTokenizer will generate tokens from inside the string as well, and not just from the edge.

    To get the behavior you're looking for you have to have a KeywordTokenizer (which will keep the input as a single token), and then use the ReverseStringFilter to reverse it - before using the EdgeNGramFilter to generate strings from the start of the now reversed string:

    foo -> oof -> o, oo, oof
    

    You can then either run these through the reversed string filter again to get the "correct" versions indexed:

    -> o, oo, foo
    

    .. or you can do as you've done in your field, and reverse the input string instead:

    foo -> oof -> matches the oof token