pythonnltk

What to download in order to make nltk.tokenize.word_tokenize work?


I am going to use nltk.tokenize.word_tokenize on a cluster where my account is very limited by space quota. At home, I downloaded all nltk resources by nltk.download() but, as I found out, it takes ~2.5GB.

This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize? So far, I've seen nltk.download('punkt') but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?


Solution

  • You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt') should do the trick.