pythonpython-3.xdependency-management

Python dependency hell: A compromise between virtualenv and global dependencies?


I've tested various ways to manage my project dependencies in Python so far:

  1. Installing everything global with pip (saves spaces, but sooner or later gets you in trouble)
  2. pip & venv or virtualenv (a bit of a pain to manage, but ok for many cases)
  3. pipenv & pipfile (a little bit easier than venv/virtualenv, but slow and some vendor-lock, virtual envs hide somewhere else than the actual project folder)
  4. conda as package and environment manager (great as long as the packages are all available in conda, mixing pip & conda is a bit hacky)
  5. Poetry - I haven't tried this one
  6. ...

My problem with all of these (except 1.) is that my harddrive space fills up pretty fast: I am not a developer, I use Python for my daily work. Therefore, I have hundreds of small projects that all do their thing. Unfortunately, for 80% of projects I need the "big" packages: numpy, pandas, scipy, matplotlib - you name it. A typical small project is about 1000 to 2000 lines of code, but has 800MB of package dependencies in venv/virtualenv/pipenv. Virtually I have about 100+ GB of my HDD filled with python virtual dependencies.

Moreover, installing all of these in each virtual environment takes time. I am working in Windows, many packages cannot be easily installed from pip in windows: Shapely, Fiona, GDAL - I need the precompiled wheels from Christoph Gohlke. This is easy, but it breaks most workflows (e.g. pip install -r requirements.txt or pipenv install from pipfile). I feel like I am 40% installing/updating package dependencies and only 60% of my time writing code. Further, none of these package managers really help with publishing & testing code, so I need other tools e.g. setuptools, tox, semantic-release, twine...

I've talked to colleagues but they all face the same problem and no one seems to have a real solution. I was wondering if there is an approach to have some packages, e.g. the ones you use in most projects, installed globally - for example, numpy, pandas, scipy, matplotlib would be installed with pip in C:\Python36\Lib\site-packages or with conda in C:\ProgramData\Miniconda3\Lib\site-packages - these are well developed packages that don't often break things. And if, I would like to fix that anyway soon in my projects.

Other things would go in local virtualenv-folders - I am tempted to move my current workflow from pipenv to conda.

Does such an approach make sense at all? At least there has been a lot of development lately in python, perhaps something emerged that I didn't see yet. Is there any best-practice guidance on how to setup files in such a mixed global-local environment, e.g. how to maintain setup.py, requirements.txt or pyproject.toml for sharing development projects through Gitlab, Github etc.? What are the pitfalls/caveats?

There's also this great blog post from Chris Warrick that explains it pretty much fully.

[Update 2020]

After half a year, I can say that working with Conda (Miniconda) has solved most of my problems:

[Update 2021]

Since this post still gets many views, here is a subjective 2021 update:

[Update 2023]

I am slowly moving away from conda. pip+venv seem to be the more viable option and it frequently works better and faster (e.g. pytorch, transformers). Forget Windows: pip only works well in WSL/Linux. For package maintainers, setuptools>64 now allows pyproject.toml-only based packages, finally a unified packaging experience! Get rid of your setup.py's.. Otherwise, I am mainly working in Jupyter in Docker these days, where the python envs are versioned inside docker containers, stored in a registry.


Solution

  • Problem

    You have listed a number of issues that no one approach may be able to completely resolve:

    'I need the "big" packages: numpy, pandas, scipy, matplotlib... Virtually I have about 100+ GB of my HDD filled with python virtual dependencies'

    ... installing all of these in each virtual environment takes time

    ... none of these package managers really help with publishing & testing code ...

    I am tempted to move my current workflow from pipenv to conda.

    Thankfully, what you have described is not quite the classic dependency problem that plagues package managers - circular dependencies, pinning dependencies, versioning, etc.


    Details

    I have used conda on Windows many years now under similar restrictions with reasonable success. Conda was originally designed to make installing scipy-related packages easier. It still does.

    If you are using the "scipy stack" (scipy, numpy, pandas, ...), conda is your most reliable choice.

    Conda can:

    Conda can't:


    Reproducible Envs

    The following steps should help reproduce virtualenvs if needed:

    Avoid pip-issues

    I was wondering if there is an approach to have some packages, e.g. the ones you use in most projects, installed globally ... Other things would go in local virtualenv-folders

    Non-conda tools

    conda

    However, if you want to stay with conda, you can try the following:

    A. Make a working environment separate from your base environment, e.g. workenv. Consider this your goto, "global" env to do a bulk of your daily work.

    > conda create -n workenv python=3.7 numpy pandas matplotblib scipy
    > activate workenv
    (workenv)>
    

    B. Test installations of uncommon pip packages (or weighty conda packages) within a clone of the working env

    > conda create --name testenv --clone workenv
    > activate testenv
    (testenv)> pip install pint
    

    Alternatively, make new environments with minimal packages using a requirements.txt file

    C. Make a backup of dependencies into a requirements.txt-like file called environment.yml per virtualenv. Optionally make a script to run this command per environment. See docs on sharing/creating environment files. Create environments in the future from this file:

    > conda create --name testenv --file environment.yml
    > activate testenv
    (testenv)> conda list
    

    Publishing

    The packaging problem is an ongoing, separate issue that has gained traction with the advent of pyproject.toml file via PEP 518 (see related blog post by author B. Cannon). Packaging tools such as flit or poetry have adopted this modern convention to make distributions and publish them to a server or packaging index (PyPI). The pyproject.toml concept tries to move away from traditional setup.py files with specific dependence to setuptools.

    Dependencies

    Tools like pipenv and poetry have a unique modern approach to addressing the dependency problem via a "lock" file. This file allows you to track and reproduce the state of your dependency graphs, something novel in the Python packaging world so far (see more on Pipfile vs. setup.py here). Moreover, there are claims that you can still use these tools in conjunction with conda, although I have not tested the extent of these claims. The lock file isn't standardized yet, but according to core developer B. Canon in an interview on The future of Python packaging, (~33m) "I'd like to get us there." (See Updates).

    Summary

    If you are working with any package from the scipy stack, use conda (Recommended):

    See Also

    Updates: