I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words.
I tried to enable stop word filtering with two different approaches.
Approach #1:
tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();
Approach #2:
tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();
The full code is available here:
https://stackoverflow.com/a/36237769/462347
My questions:
Why Lucene doesn't filter stop words?
How can I enable the stop words filtering in Lucene 5.5 / 6.0?
The pitfall was in the default Lucene's stop words list, I expected, it is much more broader.
Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:
CharArraySet stopWordsSet;
try {
// use customized stop words list
String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
// use standard stop words list
stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}
tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();