pythonhadoopanacondamrjob

Is is possible to use a Conda environment as "virtualenv" for a Hadoop Streaming Job (in Python)?


We are currently using Luigi, MRJob and other frameworks to run Hadoo streaming jobs using Python. We are already able to ship the jobs with its own virtualenv so no specific Python dependencies are installed in the nodes (see the article). I was wondering if someone has done similar with Anaconda/Conda Package manager.

PD. I am also aware of Conda-Cluster, however it looks like a more complex/sophisticated solution (and it is behind a paywall).


Solution

  • Update 2019:

    The answer is yes and the way of doing it is using conda-pack

    https://conda.github.io/conda-pack/