I hope you can help me, because this problem drives me crazy.
To make it simple I have documents with fields named name_text_de_de which has following content:
name_text_de_de
Industrie-Reiniger
Katalysator-Reiniger
Flächenreiniger
UNIVERSALREINIGER
FELGENREINIGER-GEL
this is not all, but some of it.
If I use this query I get these results above: q=name_text_de_de:*reinig
but NO result if I use the following query: q=name_text_de_de:*reiniger
which does not make sense at all.
what could be the problem here?
Thanks in advance,
Fide
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
<filter class="solr.ManagedStopFilterFactory" managed="de" />
<!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
<!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
<!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
<!-- <filter class="solr.GermanStemFilterFactory" /> -->
<!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
<filter class="solr.GermanMinimalStemFilterFactory" />
<!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
<filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
<filter class="solr.ManagedStopFilterFactory" managed="de_de" />
<!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
<!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
<!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
<!-- <filter class="solr.GermanStemFilterFactory" /> -->
<!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
<filter class="solr.GermanMinimalStemFilterFactory" />
<!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_de_de" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
<filter class="solr.ManagedStopFilterFactory" managed="de_de" />
<!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
<!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
<!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
<!-- <filter class="solr.GermanStemFilterFactory" /> -->
<!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
<filter class="solr.GermanMinimalStemFilterFactory" />
<!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
<filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
<filter class="solr.ManagedStopFilterFactory" managed="de_de" />
<!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
<!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
<!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
<!-- <filter class="solr.GermanStemFilterFactory" /> -->
<!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
<filter class="solr.GermanMinimalStemFilterFactory" />
<!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_spell_de" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ManagedStopFilterFactory" managed="de_de" />
<!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
<filter class="solr.ManagedStopFilterFactory" managed="de_de" />
<!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_spell_de_de" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ManagedStopFilterFactory" managed="de_de" />
<!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
<filter class="solr.ManagedStopFilterFactory" managed="de_de" />
<!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
The problem is that wildcard queries are not processed through the analysis chain, so your query is not stemmed as the original text.
For example here the token reiniger
, which is truncated to reinig
by the stem filter at index time, can't match *reiniger
(unfiltered) because there is no token ending with "reiniger" in the index.
Input stream | Indexed tokens
-------------------------|--------------------------
"Industrie-Reiniger" | "industri", "reinig"
"Katalysator-Reiniger" | "katalysato", "reinig"
"Flächenreiniger" | "flachenreinig"
"UNIVERSALREINIGER" | "universalreinig"
"FELGENREINIGER-GEL" | "felgenreinig", "gel"
To make wildcards queries and fuzzy search work properly with stemmers (and other filters that may truncate tokens), you need to add the KeywordRepeatFilterFactory before the stemmer in the analysis chain :
Emits each token twice, one with the KEYWORD attribute and once without.
If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.