pythonnltknltk-trainer

How to train NLTK PunktSentenceTokenizer batchwise?


I am trying to split financial documents to sentences. I have ~50.000 documents containing plain English text. The total file size is ~2.6 GB.

I am using NLTK's PunktSentenceTokenizer with the standard English pickle file. I additionally tweaked it with providing additional abbreviations but the results are still not accurate enough.

Since NLTK PunktSentenceTokenizer bases on the unsupervised algorithm by Kiss & Strunk (2006) I am trying to train the sentence tokenizer based on my documents, based on training data format for nltk punkt.

import nltk.tokenize.punkt
import pickle
import codecs

tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt", "r", "utf8").read()
tokenizer.train(text)
out = open("someplain.pk", "wb")
pickle.dump(tokenizer, out)
out.close()

Unfortunately, when running the code, I got an error, that there is not sufficient memory. (Mainly because I first concatenated all the files to one big file.)

Now my questions are:

  1. How can I train the algorithm batchwise and would that lead to a lower memory consumption?
  2. Can I use the standard English pickle file and do further training with that already trained object?

I am using Python 3.6 (Anaconda 5.2) on Windows 10 on a Core I7 2600K and 16GB RAM machine.


Solution

  • I found this question after running into this problem myself. I figured out how to train the tokenizer batchwise and am leaving this answer for anyone else looking to do this. I was able to train a PunktSentenceTokenizer on roughly 200GB of Biomedical text content in around 12 hours with a memory footprint no greater than 20GB at a time. Nevertheless, I'd like to second @colidyre's recommendation to prefer other tools over the PunktSentenceTokenizer in most situations.

    There is a class PunktTrainer you can use to train the PunktSentenceTokenizer in a batchwise fashion.

    from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
    

    Suppose we have a generator that yields a stream of training texts

    texts = text_stream()
    

    In my case, each iteration of the generator queries a database for 100,000 texts at a time, then yields all of these texts concatenated together.

    We can instantiate a PunktTrainer and then begin training

    trainer = PunktTrainer()
    for text in texts:
        trainer.train(text)
        trainer.freq_threshold()
    

    Notice the call to the freq_threshold method after processing each text. This reduces the memory footprint by cleaning up information about rare tokens that are unlikely to influence future training.

    Once this is complete, call the finalize training method. Then you can instantiate a new tokenizer using the parameters found during training.

    trainer.finalize_training()
    tokenizer = PunktSentenceTokenizer(trainer.get_params())
    

    @colidyre recommended using spaCy with added abbreviations. However, it can be difficult to know which abbreviations will appear in you text domain in advance. To get the best of both worlds you can add the abbreviations found by Punkt. You can get a set of these abbreviations in the following way

    params = trainer.get_params()
    abbreviations = params.abbrev_types