solrstemmingfuzzyporter-stemmer

Does stemming and fuzzy search work together in Apache Solr


I am using porter filter factory for a field which has 3 to 4 words in it.

Eg : "ABC BLOSSOM COMPANY"

I expect to fetch the above document when i search for ABC BLOSSOMING COMPANY as well.

When i query this:

name:ABC AND name:BLOSSOMING AND name:COMPANY

i get my result

This is what the parsed query looks like

+name:southern +name:blossom +name:compani (Stemmer works fine)

But when i add the fuzzy syntax and query like this,

name:ABC~1 AND name:BLOSSOMING~1 AND name:COMPANY~1

the search does not give any documents as result and the parsed query looks like this

+name:abc~1 +name:blossoming~1 +name:company~2

This clearly shows that stemming is not happening. Kindly review and give feedback.


Solution

  • TL;DR
    Stemming is not happening, since you have used the PorterFilter, which is not a MultiTermAwareComponent.

    What To Do?
    Use one of the Filters/Normalizers that implements the MultiTermAwareComponent interface.

    Explanation
    You, like many others, are caught by Solr's and Lucense Multiterm behaviour. There is a good article about this topic on the Solr wiki. All though this article is dated, it still holds true

    One of the surprises for most Solr users is that wildcards queries haven't gone through any analysis. Practically, this means that wildcard (and prefix and range) queries are case sensitive, which is at odds with expectations. As of this SOLR-2438, SOLR-2918, and perhaps SOLR-2921, this behavior is changed.

    What's a multiterm you ask? Essentially it's any term that may "point to" more than one real term. For instance, run* could expand to runs, runner, running, runt, etc. Likewise, a range query is really a "multiterm" query as well. Before Solr 3.6, these were completely unprocessed, the application layer usually had to apply any transformations required, for instance lower-casing the input. Running these types of terms through a "normal" query analysis chain leads to all sorts of interesting behavior so was avoided.