javalucene

Can not remove stop words using StandardAnalyzer from Apache Lucene


I use below code to remove stop words from string but it not working:

package com.example;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class Main {
    public static void main(String[] args) throws IOException {
        String text = "The quick brown fox jumps over the lazy dog";

        Analyzer analyzer = new StandardAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("field", text);
        CharTermAttribute charTermAttr = tokenStream.addAttribute(CharTermAttribute.class);

        tokenStream.reset();
        List<String> tokens = new ArrayList<>();
        while (tokenStream.incrementToken()) {
            tokens.add(charTermAttr.toString());
        }
        tokenStream.end();

        System.out.println("Tokens: " + tokens);
    }
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>demo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>21</maven.compiler.source>
        <maven.compiler.target>21</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>10.0.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>10.0.0</version>
        </dependency>


        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analysis-common</artifactId>
            <version>10.0.0</version>
        </dependency>
    </dependencies>
</project>

Expected result: Tokens: [quick, brown, fox, jumps, lazy, dog]

Real result: Tokens: [the, quick, brown, fox, jumps, over, the, lazy, dog]

As you can see the lucene version is 10.0.0 (current latest version), and Java version is 21 (current LTS)

As said in here: https://lucene.apache.org/core/10_0_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html, I see default constructor of StandardAnalyzer: "Builds an analyzer with no stop words.", but it not working as I use. Does anyone know what happened?

I tried to read example on the Internet, read the docs of Apache Lucene 10.0.0


Solution

  • The StandardAnalyzer constructor you are using is this:

    Analyzer analyzer = new StandardAnalyzer();
    

    As you note in your question, this constructor "builds an analyzer with no stop words".

    That means the analyzer does not have a list of stopwords - and therefore does not have any information about what stopwords you want to remove/ignore when you build your index.

    (It doesn't mean "there will be no stopwords in your index" - it actually means the opposite: There will be no stopwords removed from your index.)


    You can use one of the other constructors, which allow you to provide that missing list.

    For example, StandardAnalyzer(CharArraySet stopWords)

    A simple example:

    import org.apache.lucene.analysis.CharArraySet;
    
    ...
    
    CharArraySet stopWords = new CharArraySet(2, true); 
    stopWords.add("foo");
    stopWords.add("bar");
    
    Analyzer analyzer = new StandardAnalyzer(stopWords);
    

    Or you can use StandardAnalyzer(Reader reader). In this case you can provide the stopwords in a file (for example). The file will be a simple text file, with one stopword on each line.


    There is a list of stopwords built into Lucene, but they are used directly by the EnglishAnalyzer, not the StandardAnalyzer.

    So, you could use that analyzer if you wanted to.

    For reference, this was a change to Lucene that happened back in version 8: Move ENGLISH_STOP_WORD_SET from StandardAnalyzer to EnglishAnalyzer.

    That makes sense, since the stopwords list in Lucene is in English.

    Older code samples of Lucene may use StandardAnalyzer() and automatically remove stopwords, for this reason. Maybe that is what you have seen somewhere.


    The list of English stopwords used by Lucene can be seen here in the source code.

    For reference:

    "a", "an", "and", "are", "as", "at", "be", 
    "but", "by", "for", "if", "in", "into", "is",
    "it", "no", "not", "of", "on", "or", "such", 
    "that", "the", "their", "then", "there",
    "these", "they", "this", "to", "was", "will", "with"