javaapacheluceneinformation-retrievalstop-words

Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter


I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words.

I tried to enable stop word filtering with two different approaches.

Approach #1:

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();

Approach #2:

tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();

The full code is available here:
https://stackoverflow.com/a/36237769/462347

My questions:

  1. Why Lucene doesn't filter stop words?

  2. How can I enable the stop words filtering in Lucene 5.5 / 6.0?


Solution

  • The pitfall was in the default Lucene's stop words list, I expected, it is much more broader.

    Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:

    CharArraySet stopWordsSet;
    
    try {
        // use customized stop words list
        String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
        stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
    } catch (FileNotFoundException e) {
        // use standard stop words list
        stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
    }
    
    tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
    tokenStream.reset();