opennlplemmatization

OpenNLP: Unable to locate the model file for Lemmatizer


Summary: Unable to find the model file used for Lemmatizer (english-lemmatizer.bin)

Details: OpenNLP Tools Models appears to be a comprehensive repository for the various models used by the different components of the Apache OpenNLP library. However, I am unable to find the model file en-lemmatizer.bin, which is used with the lemmatizer. The Apache OpenNLP Developer Manual provides the following code snippet for the Lemmatization step:

InputStream dictLemmatizer = null;

try (dictLemmatizer = new FileInputStream("english-lemmatizer.bin")) {

}

However, unlike other model files, I am just not able to find the location of this model file. Any pointers would be appreciated.


Solution

  • The book "Natural Language Processing with Java Cookbook' by Richard M. Reese provides a good answer. For some reason en-lemmatizer.bin is not available for direct download from the web, but it can be created using the following steps:

    1. Download and untar apache-opennlp-1.9.0-bin.tar (https://opennlp.apache.org/download.html)

    2. Go to the URL for the Lemmatizer Training File and save the text content as en-lemmatizer.dict

    3. Go to the bin directory (from step 1, after untarring) and execute the following command:

    opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data /path/to/en-lemmatizer.dict -encoding UTF-8


    Note: Be prepared to handle the following error:

    Computing event counts... Exception in thread "main" java.lang.OutOfMemoryError: Java heap space