I am going to use nltk.tokenize.word_tokenize
on a cluster where my account is very limited by space quota. At home, I downloaded all nltk
resources by nltk.download()
but, as I found out, it takes ~2.5GB.
This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize
? So far, I've seen nltk.download('punkt')
but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?
You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt')
should do the trick.