google-cloud-platformgoogle-cloud-dataflowapache-beamgoogle-data-api

nltk dependencies in dataflow


I know that external python dependencies can by fed into Dataflow via the requirements.txt file. I can successfully load nltk in my Dataflow script. However, nltk often needs further files to be downloaded (e.g. stopwords or punkt). Usually on a local run of the script, I can just run

nltk.download('stopwords')
nltk.download('punkt')

and these files will be available to the script. How do I do this so the files are also available to the worker scripts. It seems like it would be extremely inefficient to place those commands into a doFn/CombineFn if they only have to happen once per worker. What part of the script is guaranteed to run once on every worker? That would probably be the place to put the download commands.

According to this, Java allows the staging of resources via classpath. That's not quite what I'm looking for in Python. I'm also not looking for a way to load additional python resources. I just need nltk to find its files.


Solution

  • You can probably use '--setup_file setup.py' to run these custom commands. https://cloud.google.com/dataflow/pipelines/dependencies-python#pypi-dependencies-with-non-python-dependencies . Does this work in your case?