I'm using lucene 3.5 with SpanishAnalyzer (that itself uses SpanishStemmer and StandardTokenizer).
When SpanishAnalyzer index a document with the words (for example) "claramente" and "claro", they will be both indexed as "clar".
This behavior is understood and useful to my needs, today before querying I use the Analyzer's tokenStream
+ incrementToken()
to get the token of my search term and search that against the indexed document. I'm not using QueryParser but building lucene query objects in code.
however I want the ability to search the exact word (in this example claro) without losing the morphological abilities of the SpanishAnalyzer.
I can skip the step above (tokenStream) and search for "claro" directly but it will not be found as it is indexed as "clar".
Also I do not want to index the field twice with 2 different analyzers as I need to have the ability to use a PhraseQuery
or SpanNearQuery
containing one exact word and one regular term (morphological).
So… and I'm getting to the point… I thought to modify the Tokenizer or Stemmer or Filter (?) so on indexing time it will index 2 tokens for each word, the stemmed one and the original one, in this case "claro" and "clar" and later when querying I can choose whether to use the exact word or the stemmed token.
I need help understanding how (and where) I can do that, I guess the edit should be done somewhere in the Stemmer.
by the way, i do exactly the same with an Hebrew Analyzer that returns several tokens for each word in the text when using incrementToken()
(but i don't have the source code)
You need a index with multiple token per position, because you want to search phrases with a mix of stemmed token and non-stemmed (=original) token. I will answer for version 5.3 but 3.5 was not very different.
Take a look to the source code of the ReversedWildcardFilter in solr. You will see the two steps on each token:
In the case of your SpanishAnalyzer this would mean e.g. the following:
The core of SpanishAnalyzer is the SpanishLightStemFilter. The SpanishLightStemFilter only stemmed Token with !KeywordAttribute.isKeyword(). So for index-time insert a KeywordRepeatFilter in SpanishAnalyzer and mark the stemmed token with a prefix.