pythonpython-importpackaging

What is the best practice for imports when developing a Python package?


I am trying to build a Python package, that contains sub-modules and sub-packages ("libraries"). I was looking everywhere for the right way to do it, but amazingly I find it very complicated. Also went through multiple threads in StackOverFlow of course..

The problem is as follows:

  1. In order to import a module or a package from another directory, it seems to me that there are 2 options: a. Adding the absolute path to sys.path. b. Installing the package with the setuptools.setup function in a setup.py file, in the main directory of the package - which installs the package into the site-packages directory of the specific Python version that in use.

  2. Option a seems too clumsy for me. Option b is great, however I find it impractical becasue I am currently working and editing the package's source code - and the changes are not updating on the installed directory of the package, of course. In addition the installed directory of the package is not tracked by Git, and needless to say I use Git the original directory.

To conclude the question: What is the best practice to import modules and sub-packages freely and nicely from within sub-directories of a Python package that is currently under construction?

I feel I am missing something but couldn't find a decent solution so far.

Thanks!


Solution

  • This is a great question, and I wish more people would think along these lines. Making a module importable and ultimately installable is absolutely necessary before it can be easily used by others.

    On sys.path munging

    Before I answer I will say I do use sys.path munging when I do initial development on a file outside of an existing package structure. I have an editor snippet that constructs code like this:

    import sys, os
    sys.path.append(os.path.expanduser('~/path/to/parent'))
    from module_of_interest import *  # NOQA
    

    Given the path to the current file I use:

    import ubelt as ub
    fpath = ub.Path('/home/username/path/to/parent/module_of_interest.py')
    modpath, modname = ub.split_modpath(fpath, check=False)
    modpath = ub.Path(modpath).shrinkuser()  # abstract home directory
    

    To construct the necessary parts the snippet will insert into the file so I can interact with it from within IPython. I find taking the little bit of extra time to remove the reference to my explicit homefolder such that the code still works as long as users have the same relative path structure wrt to the home directory makes this slightly more portable.

    Proper Python Package Management

    That being said, sys.path munging is not a sustainable solution. Ultimately you want your package to be managed by a python package manger. I know a lot of people use poetry, but I like plain old pip, so I can describe that process, but know this isn't the only way to do it.

    To do this we need to go over some basics.

    Basics

    1. You must know what Python environment you are working in. Ideally this is a virtual environment managed with pyenv (or conda or mamba or poetry ...). But it's also possible to do this in your global sytem Python environment, although that is not recommended. I like working in a single default Python virtual environment that is always activated in my .bashrc. Its always easy to switch to a new one or blow it away / start fresh.

    2. You need to consider two root paths: the root of your repository, which I will call your repo path, and your root to your package, the package path or module path, which should be a folder with the name of the top-level Python package. You will use this name to import it. This package path must live inside the repo path. Some repos, like xdoctest, like to put the module path in a src directory. Others , like ubelt, like to have the repo path at the top-level of the repository. I think the second case is conceptually easier for new package creators / maintainers, so let's go with that.

    Setting up the repo path

    So now, you are in an activated Python virtual environment, and we have designated a path we will checkout the repo. I like to clone repos in $HOME/code, so perhaps the repo path is $HOME/code/my_project.

    In this repo path you should have your root package path. Lets say your package is named mypymod. Any directory that contains an __init__.py file is conceptually a python module, where the contents of __init__.py are what you get when you import that directory name. The only difference between a directory module and a normal file module is that a directory module/package can have submodules or subpackages.

    For example if you are in the my_project repo, i.e. when you ls you see mypymod, and you have a file structure that looks something like this...

    + my_project
        + mypymod
            + __init__.py
            + submod1.py
            + subpkg
                + __init__.py
                + submod2.py
    
    

    you can import the following modules:

    import mypymod
    import mypymod.submod1
    import mypymod.subpkg
    import mypymod.subpkg.submod2
    
    

    If you ensured that your current working directory was always the repo root, or you put the repo root into sys.path, then this would be all you need. Being visible in sys.path or the CWD is all that is needed for another module could see your module.

    The package manifest: setup.py / pyproject.toml

    Now the trick is: how do you ensure your other packages / scripts can always see this module? That is where the package manager comes in. For this we will need a setup.py or the newer pyproject.toml variant. I'll describe the older setup.py way of doing things.

    All you need to do is put the setup.py in your repo root. Note: it does not go in your package directory. There are plenty of resources for how to write a setup.py so I wont describe it in much detail, but basically all you need is to populate it with enough information so it knows about the name of the package, its location, and its version.

    from setuptools import setup, find_packages
    setup(
        name='mypymod',
        version='0.1.0',
        packages=find_packages(include=['mypymod', 'mypymod.*']),
        install_requires=[],
    )
    
    

    So your package structure will look like this:

    + my_project
        + setup.py
        + mypymod
            + __init__.py
            + submod1.py
            + subpkg
                + __init__.py
                + submod2.py
    
    

    There are plenty of other things you can specify, I recommend looking at ubelt and xdoctest as examples. I'll note they contain a non-standard way of parsing requirements out of a requirements.txt or requirements/*.txt files, which I think is generally better than the standard way people handle requirements. But I digress.

    Given something that pip or some other package manager (e.g. pipx, poetry) recognizes as a package manifest - a file that describes the contents of your package, you can now install it. If you are still developing it you can install it in editable mode, so instead of the package being copied into your site-packages, only a symbolic link is made, so any changes in your code are reflected each time you invoke Python (or immediately if you have autoreload on with IPython).

    With pip it is as simple as running pip install -e <path-to-repo-root>, which is typically done by navigating into the repo and running pip install -e ..

    Congrats, you now have a package you can reference from anywhere.

    Making the most of your package

    The python -m invocation

    Now that you have a package you can reference as if it was installed via pip from pypi. There are a few tricks for using it effectively. The first is running scripts.

    You don't need to specify a path to a file to run it as a script in Python. It is possible to run a script as __main__ using only its module name. This is done with the -m argument to Python. For instance you can run python -m mypymod.submod1 which will invoke $HOME/code/my_project/mypymod/submod1.py as the main module (i.e. it's __name__ attribute will be set to "__main__").

    Furthermore if you want to do this with a directory module you can make a special file called __main__.py in that directory, and that is the script that will be executed. For instance if we modify our package structure

    + my_project
        + setup.py
        + mypymod
            + __init__.py
            + __main__.py
            + submod1.py
            + subpkg
                + __init__.py
                + __main__.py
                + submod2.py
    

    Now python -m mypymod will execute $HOME/code/my_project/mypymod/__main__.py and python -m mypymod.subpkg will execute $HOME/code/my_project/mypymod/subpkg/__main__.py. This is a very handy way to make a module double as both a importable package and a command line executable (e.g. xdoctest does this).

    Easier imports

    One pain point you might notice is that in the above code if you run:

    import mypymod
    mypymod.submod1
    

    You will get an error because by default a package doesn't know about its submodules until they are imported. You need to populate the __init__.py to expose any attributes you desire to be accessible at the top-level. You could populate the mypymod/__init__.py with:

    from mypymod import submod1
    

    And now the above code would work.

    This has a tradeoff though. The more thing you make accessible immediately, the more time it takes to import the module, and with big packages it can get fairly cumbersome. Also you have to manually write the code to expose what you want, so that is a pain if you want everything.

    If you took a look at ubelt's init.py you will see it has a good deal of code to explicitly make every function in every submodule accessible at a top-level. I've written yet another library called mkinit that actually automates this process, and it also has the option of using the lazy_loader library to mitigate the performance impact of exposing all attributes at the top-level. I find the mkinit tool very helpful when writing large nested packages.

    Summary

    To summarize the above content:

    1. Make sure you are working in a Python virtualenv (I recommend pyenv)
    2. Identify your "package path" inside of your "repo path".
    3. Put an __init__.py in every directory you want to be a Python package or subpackage.
    4. Optionally, use mkinit to autogenerate the content of your __init__.py files.
    5. Put a setup.py / pyproject.toml in the root of your "repo path".
    6. Use pip install -e . to install the package in editable mode while you develop it.
    7. Use python -m to invoke module names as scripts.

    Hope this helps.