pythonnlpnltk

How do I test whether an nltk resource is already installed on the machine running my code?


I just started my first NLTK project and am confused about the proper setup. I need several resources like the Punkt Tokenizer and the maxent pos tagger. I myself downloaded them using the GUI nltk.download(). For my collaborators I of course want that this things get downloaded automatically. I haven't found any idiomatic code for that in the docu.

Am I supposed to just put nltk.data.load('tokenizers/punkt/english.pickle') and their like into the code? Is this going to download the resources every time the script is run? Am I to provide feedback to the user (i.e. my co-developers) of what is being downloaded and why this is taking so long? There MUST be gear out there that does the job, right? :)

//Edit To explify my question:
How do I test whether an nltk resource (like the Punkt Tokenizer) is already installed on the machine running my code, and install it if it is not?


Solution

  • You can use the nltk.data.find() function, see https://github.com/nltk/nltk/blob/develop/nltk/data.py:

    >>> import nltk
    >>> nltk.data.find('tokenizers/punkt.zip')
    ZipFilePathPointer(u'/home/alvas/nltk_data/tokenizers/punkt.zip', u'')
    

    When the resource is not available you'll find the error:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk-3.0a3-py2.7.egg/nltk/data.py", line 615, in find
        raise LookupError(resource_not_found)
    LookupError: 
    **********************************************************************
      Resource u'punkt.zip' not found.  Please use the NLTK Downloader
      to obtain the resource:  >>> nltk.download()
      Searched in:
        - '/home/alvas/nltk_data'
        - '/usr/share/nltk_data'
        - '/usr/local/share/nltk_data'
        - '/usr/lib/nltk_data'
        - '/usr/local/lib/nltk_data'
    **********************************************************************
    

    Most probably, you would like to do something like this to ensure that your collaborators have the package:

    >>> try:
    ...     nltk.data.find('tokenizers/punkt')
    ... except LookupError:
    ...     nltk.download('punkt')
    ... 
    [nltk_data] Downloading package punkt to /home/alvas/nltk_data...
    [nltk_data]   Package punkt is already up-to-date!
    True