javathread-safetyapache-tikalanguage-detection

Are Apache Tika's LanguageDetectors thread-safe?


Consider the following code:

final String[] texts = {
    "Allons, enfants de la Patrie, Le jour de gloire est arrivé",
    "O Tannenbaum, o Tannenbaum, wie treu sind deine Blätter!",
    "..."
};

final LanguageDetector ld = new OptimaizeLangDetector();  // or e.g. OpenNLPDetector
ld.loadModels();
Arrays.stream(texts).parallel().forEach(text -> System.out.println(ld.detect(text)));

Can I assume that ld.detect() and ld.detectAll() are thread-safe and can be ran in parallel on multiple texts using a single LanguageDetector instance?

The thing that makes me worry is that LanguageDetector has methods like addText(), hasEnoughText() and reset() which make it stateful, and therefore - by definition - non-thread-safe...

https://tika.apache.org/2.7.0/api/org/apache/tika/language/detect/LanguageDetector.html


Solution

  • A requirement for a class to be thread-safe, is that it is immutable. That means after construction, instance methods are not allowed to change any members.

    When reading the source for org.apache.tika.langdetect.optimaize.OptimaizeLangDetector here

    we'll see this instance method

    public void reset() {
        writer.reset();
    }
    

    which is changing member

    private CharArrayWriter writer;
    

    and with that the state of the OptimaizeLangDetector instance. Hence OptimaizeLangDetector is not thread-safe.