pythondockernltk

Docker NLTK Download


I am building a docker container using the following Dockerfile:

FROM ubuntu:14.04

RUN apt-get update

RUN apt-get install -y python python-dev python-pip

ADD . /app

RUN apt-get install -y python-scipy

RUN pip install -r /arrc/requirements.txt

EXPOSE 5000

WORKDIR /app

CMD python app.py

Everything goes well until I run the image and get the following error:

**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/root/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

I have had this problem before and it is discussed here however I am not sure how to approach it using Docker. I have tried:

CMD python
CMD import nltk
CMD nltk.download()

as well as:

CMD python -m nltk.downloader -d /usr/share/nltk_data popular

But am still getting the error.


Solution

  • In your Dockerfile, try adding instead:

    RUN python -m nltk.downloader punkt

    This will run the command and install the requested files to //nltk_data/

    The problem is most likely related to using CMD vs. RUN in the Dockerfile. Documentation for CMD:

    The main purpose of a CMD is to provide defaults for an executing container.

    which is used during docker run <image>, not during build. So other CMD lines probably were overwritten by the last CMD python app.py line.