pythonpypimaintainability

Python: Should I save PyPi packages offline as a backup?


My Python projects heavily depends on PyPi packages.
I want to make sure that: in any time in the future: the packages required by my apps will always be available online on PyPi.
For example:-
I found a project on Github that requires PyQt4.
when I tried to run it on my Linux machine,
it crashed on startup because it can't find PyQt4 package on PyPi.

NB: I know that PyQt4 is deprecated

I searched a lot to find an archive for PyPi that still holds PyQt4 package, but I couldn't find them anywhere.

so I had to rewrite that app to make it work on PyQt5.
I only changed the code related to the UI (ie: PyQt4).
other functions were still working.

so the only problem with that app was that PyQt4 package was removed from PyPi.



so, my question is: should I save a backup of the PyPi packages I use ?

Solution

  • Short version:

    YES, if you want availability... The next big question is how best to keep a backup version of the dependencies? There are some suggestions at the end of this answer.

    Long version:

    Your question touches on the concept of "Availability" which is one of the three pillars of Information Assurance (or Information Security). The other two pillars are Confidentiality and Integrity... The CIA triad.

    PyPI packages are maintained by the owners of those packages, a project that depends on a package and list it as a dependency must take into account the possibility that the owner of the package will pull the package or a version of the package out of PyPI at any moment.

    Important Python packages with many dependencies usually are maintained by foundations or organizations that are more responsible with dealing with downstream dependent packages and projects. However keeping support for old packages is very costly and requires extra effort and usually maintainers set a date for end of support, or publish a package lifecycle where they state when a specific version will be removed from the public PyPI server.

    Once that happens, the dependents have to update their code (as you did), or provide the original dependency via alternative means.

    This topic is very important for procurement in libraries, universities, laboratories, companies, and government agencies where a software tool might have dependencies on other software packages (or ecosystem), and where "availability" should be addressed adequately.

    Addressing this risk might mean anything from ensuring high availability at all costs, to living with the risk of losing one or more dependencies... A risk management approach should be used to make informed choices affecting the "security" of your project.

    Also it should be noted that, some packages require binary executable or binary libraries or access to a an online API service, which should also be available for the package to work properly, and that complicates the risk analysis and complicates the activities necessary to address availability.

    Now to make sure that dependencies are always available... I quickly compiled the following list. Note that each option has pros and cons. You should evaluate these and other options based on your needs:

    1. Store the virtual environment along with the code. Once you create a virtual environment and install the packages you require for the project in that virtual environment, you can keep the virtual environment as part of your repository for example for posterity.
    2. Host your own PyPI instance (or mirror) and keep a copy of packages you depend upon hosted on it: https://packaging.python.org/en/latest/guides/hosting-your-own-index/
    3. Use an "artifact management tool" such as Artifactory from https://jfrog.com/artifact-management/, where you can not only host python packages but also Docker images, nmap packages, and other kinds of artifacts.
    4. Get the source code of all dependencies, and always build from source.
    5. Create a Docker image where the project works properly and keep backups of the image.
    6. If the package requires an online API service, think about replacing that service or mocking it by one you can control.