solrsolrcloudstemming

Solr does not provide existing result


I hope you can help me, because this problem drives me crazy.

To make it simple I have documents with fields named name_text_de_de which has following content:

name_text_de_de
Industrie-Reiniger
Katalysator-Reiniger
Flächenreiniger
UNIVERSALREINIGER
FELGENREINIGER-GEL

this is not all, but some of it. If I use this query I get these results above: q=name_text_de_de:*reinig but NO result if I use the following query: q=name_text_de_de:*reiniger which does not make sense at all.

what could be the problem here?

Thanks in advance,

Fide

        <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
                <filter class="solr.ManagedStopFilterFactory" managed="de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
                <filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
                <!-- <filter class="solr.GermanStemFilterFactory" /> -->
                <!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
                <filter class="solr.GermanMinimalStemFilterFactory" />
                <!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
                <filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
                <filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
                <!-- <filter class="solr.GermanStemFilterFactory" /> -->
                <!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
                <filter class="solr.GermanMinimalStemFilterFactory" />
                <!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
        </fieldType>

        <fieldType name="text_de_de" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
                <filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
                <!-- <filter class="solr.GermanStemFilterFactory" /> -->
                <!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
                <filter class="solr.GermanMinimalStemFilterFactory" />
                <!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
                <filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
                <filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
                <!-- <filter class="solr.GermanStemFilterFactory" /> -->
                <!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
                <filter class="solr.GermanMinimalStemFilterFactory" />
                <!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
        </fieldType>

        <fieldType name="text_spell_de" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
        </fieldType>

        <fieldType name="text_spell_de_de" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
        </fieldType>

Solution

  • The problem is that wildcard queries are not processed through the analysis chain, so your query is not stemmed as the original text.

    For example here the token reiniger, which is truncated to reinig by the stem filter at index time, can't match *reiniger (unfiltered) because there is no token ending with "reiniger" in the index.

     Input stream            |  Indexed tokens
    -------------------------|--------------------------
     "Industrie-Reiniger"    |  "industri", "reinig"
     "Katalysator-Reiniger"  |  "katalysato", "reinig"
     "Flächenreiniger"       |  "flachenreinig"
     "UNIVERSALREINIGER"     |  "universalreinig"
     "FELGENREINIGER-GEL"    |  "felgenreinig", "gel"
    

    To make wildcards queries and fuzzy search work properly with stemmers (and other filters that may truncate tokens), you need to add the KeywordRepeatFilterFactory before the stemmer in the analysis chain :

    Emits each token twice, one with the KEYWORD attribute and once without.

    If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.