javanlplanguage-modellingpipe

Incremental language model training with lingpipe


I'm trying to train a DynamicLMClassifier.createNGramProcess(categories,nGram) on a big dataset > 20GB. I'm currently feeding the entire training file as a String to the training methods and for obvious reasons i'm getting a java.lang.OutOfMemoryError: Java heap space

Although it might be possible to increase the JVM heap size to support such training i'm interested in finding an incremental method.

The training code looks like this:

char[] csBuf = new char[numChars];
for (int i = 0; i < categories.length; ++i) {
    String category = categories[i];
    File trainingFile = new File(new File(dataDir,category),
                                 category + ".txt");
    FileInputStream fileIn
        = new FileInputStream(trainingFile);
    InputStreamReader reader
        = new InputStreamReader(fileIn,Strings.UTF8);
    reader.read(csBuf);
    String text = new String(csBuf,0,numChars);
    Classification c = new Classification(category);
    Classified<CharSequence> classified
        = new Classified<CharSequence>(text,c);
    classifier.handle(classified);
    reader.close();
}

The ideal solution would be to feed classifier.handle() in a loop of N subsets of the training set. Theoretically I think this should be possible since the model only needs to remember ngrams tuples with their respective counts to compute the MLE.


Solution

  • Yes, you can train these classifiers incrementally. You just need to write your own data handler that doesn't try to read all the data in at once. The above doesn't buffer all the data, but reads it in once per training item, so that should work. If you're still running out of memory, it's probably just because it takes a lot of memory to build a language model over 20GB if you have long contexts or don't explicitly prune as you go.

    I wrote a paper on how LingPipe's scaling works for language models, and the incremental classifiers just build a bunch of parallel language models.

    http://www.aclweb.org/anthology/W05-1107

    An even more extreme alternative that can save memory is to train each category separately and then combine them later into a classifier, which is also supported by the LingPipe API.