pythonjoblib

joblib persistence across sessions/machines


Is joblib (https://joblib.readthedocs.io/en/latest/index.html) expected to be reliable across different machines, or ways of running functions, even different sessions on the same machine over time?

For concreteness if you run this code in a Jupyter notebook, or as a python script, or piped to the stdin a python interpreter, you get different cache entries. The piped version seems to be a special case where you get a JobLibCollisionWarning that leads to it running every time and never reading from the cache. The other two though, end up having a different path saved in the joblib cache dir, and inside each one the same hash directory (fb65b1dace3932d1e66549411e3310b6) exists.

from joblib import Memory

memory = Memory('./cache', verbose=0)

@memory.cache
def job(x):
    print(f'Running with {x}')
    return x**2

print(job(2))

you get multiple cache entries. These entries also are in folders that contain path information (including what appears to be a tmp directory for the notebook entry, e.g. main--var-folders-3q-ht_2mtk52hl7ydxrcr87z2gr0000gn-T-ipykernel-3189892766), so it looks like if I transferred to another machine that the jobs would all be run again. I don't know how that path is reliable in the long run, it seems likely the tmpdir could change, or the ipykernel could have some other number associated with it.

Is this expected?


Solution

  • The canonical way of using joblib's disk caching seems to require always having your function in a .py file, and not in a notebook cell (see e.g. this issue).

    I've found a workaround for using joblib in a jupyter notebook anyway, so that it hits the cache even if you re-run a cell or restart the notebook (which does not happen by default, as you've found).

    It is to manually set the __module__ of your function to some unique identifier (e.g, the notebook name).

    The following is a wrapper for Memory.cache that does that:

    def cache(mem, module, **mem_kwargs):
        def cache_(f):
            f.__module__ = module
            f.__qualname__ = f.__name__
            return mem.cache(f, **mem_kwargs)
        return cache_
    

    Usage, with your example function:

    from joblib import Memory
    
    mem = Memory('./cache')
    
    @cache(mem, "my_notebook_name")
    def job(x):
        print(f'Running with {x}')
        return x**2
    

    This will save the function's outputs in ./cache/my_notebook_name/job/.

    (Had this idea while reading joblib's source, specifically get_func_name in func_inspect.py, in which the function's __module__ and __name__/__qualname__ are read).