pythonpython-3.xipythonjupyter-notebookipython-parallel

How to correctly import modules on engines in Jupyter Notebook for parallel processing?


I wish to run an embarrassingly parallel function that creates plots (and eventually will save them to a file) using Jupyter Notebook with Python (edit - I found a much simpler way to do exactly this here). I'm trying the simplest version possible and I'm getting an import error.

Where and why should I import the relevant modules? I think I'm importing them everywhere just to be sure but still I have an error!

The positions in the files for the imports are numbered from 1-4

[1] Is this line really necessary? why?

[2] Is this line really necessary? why?

[3] Is this line really necessary? why?

[4] Is this line really necessary? why?

Below are my files: The jupyter notebook file:

import ipyparallel
clients = ipyparallel.Client()
print(clients.ids)
dview = clients[:]
with dview.sync_imports():
    import module #[1]
    import matplotlib #[2]
import module #[3]
dview.map_sync(module.pll, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

and a python file name module.py

import matplotlib #[4]
def pll(x):
    matplotlib.pyplot.plot(x, '.')

When I run the notebook I get the following output

[0, 1, 2, 3, 4, 5]
importing module on engine(s)
importing matplotlib on engine(s)
[Engine Exception]
NameErrorTraceback (most recent call last)<string> in <module>()
(...)
NameError: name 'matplotlib' is not defined

Solution

  • Short answer

    sync_imports is unnecessary when you use module functions. This should be sufficient:

    # notebook:
    import ipyparallel as ipp
    client = ipp.Client()
    dview = client[:]
    
    import module
    dview.map_sync(module.pll, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    

    and

    # module.py
    from matplotlib import pyplot
    def pll(x):
        pyplot.plot(x, '.')
    

    One caveat: You will almost certainly want to setup matplotlib to use a non-default backend on the engines. You must do this before importing pyplot. The two logical choices with ipython parallel are agg if you are just saving to files, or %matplotlib inline if you want to see plots interactively in the notebook. To use agg:

    import matplotlib
    dview.apply_sync(matplotlib.use, 'agg')
    

    or setup inline plotting:

    %px %matplotlib inline
    

    Long answer

    To answer your bulleted questions:

    There are two contexts you need to think about when dealing with what needs to be imported and where:

    interactively defined functions

    When a function is defined interactively (that is, the def foo() is in your notebook), name lookup is performed in the interactive namespace, and the interactive namespace on your engines may differ between the engines and client. For instance, you can see this with:

    import numpy
    %px numpy = 'whywouldyoudothis'
    
    def return_numpy():
        return numpy # resolved locally *on engines*
    dview.apply_sync(return_numpy)
    

    where the apply will return a list of ['why..'] strings, not your local numpy import. Python doesn't know that names refer to modules or anything else; it's all a matter of what namespace(s) are used for looking up the names. This is why you will often see interactively defined functions that look like one of these:

    import module
    %px import module
    def foo():
        return module.x
    

    or this:

    def foo():
        import module
        return module.x
    

    Both are ways to ensure that module in foo maps to the imported module on the engines: one performs an interactive-namespace import everywhere and relies on global-namespace lookup. The other imports in the function, so it can't be wrong.

    sync_imports() is a pure-Python way to do the same thing as:

    import module
    %px import module
    

    It imports the module both here and there. If you use sync_imports, it is unnecessary to repeat the import locally as well, as the local import has already been performed.

    module functions

    If the function is defined in a module, as yours is, it will find globals in its module, not in the interactive namespace. So import matplotlib in your notebook has no effect on whether the matplotlib name is defined when module.pll is called. Similarly, importing matplotlib in the module does not make it available in the interactive namespace.

    Something important to consider: when you send a module function to the engines, it only sends a reference to the function, not the content of the function or module. So if from module import pll returns something different on the engines from the client, you will get different behavior. This can trip people up when working with local modules in ipython parallel while actively changing that module. Reloading that module in the notebook does not reload the module on the engines. It's going to send the same module.pll reference. So if you are actively working on module.py, you are going to need to call reload(module) everywhere when that module changes.