I am trying to split financial documents to sentences. I have ~50.000 documents containing plain English text. The total file size is ~2.6 GB.
I am using NLTK's PunktSentenceTokenizer
with the standard English pickle file. I additionally tweaked it with providing additional abbreviations but the results are still not accurate enough.
Since NLTK PunktSentenceTokenizer bases on the unsupervised algorithm by Kiss & Strunk (2006) I am trying to train the sentence tokenizer based on my documents, based on training data format for nltk punkt.
import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt", "r", "utf8").read()
tokenizer.train(text)
out = open("someplain.pk", "wb")
pickle.dump(tokenizer, out)
out.close()
Unfortunately, when running the code, I got an error, that there is not sufficient memory. (Mainly because I first concatenated all the files to one big file.)
Now my questions are:
I am using Python 3.6 (Anaconda 5.2) on Windows 10 on a Core I7 2600K and 16GB RAM machine.
I found this question after running into this problem myself. I figured out how to train the tokenizer batchwise and am leaving this answer for anyone else looking to do this. I was able to train a PunktSentenceTokenizer
on roughly 200GB of Biomedical text content in around 12 hours with a memory footprint no greater than 20GB at a time. Nevertheless, I'd like to second @colidyre's recommendation to prefer other tools over the PunktSentenceTokenizer
in most situations.
There is a class PunktTrainer
you can use to train the PunktSentenceTokenizer
in a batchwise fashion.
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
Suppose we have a generator that yields a stream of training texts
texts = text_stream()
In my case, each iteration of the generator queries a database for 100,000 texts at a time, then yields all of these texts concatenated together.
We can instantiate a PunktTrainer
and then begin training
trainer = PunktTrainer()
for text in texts:
trainer.train(text)
trainer.freq_threshold()
Notice the call to the freq_threshold
method after processing each text. This reduces the memory footprint by cleaning up information about rare tokens that are unlikely to influence future training.
Once this is complete, call the finalize training method. Then you can instantiate a new tokenizer using the parameters found during training.
trainer.finalize_training()
tokenizer = PunktSentenceTokenizer(trainer.get_params())
@colidyre recommended using spaCy with added abbreviations. However, it can be difficult to know which abbreviations will appear in you text domain in advance. To get the best of both worlds you can add the abbreviations found by Punkt. You can get a set of these abbreviations in the following way
params = trainer.get_params()
abbreviations = params.abbrev_types