pythonmultiprocessingpathos

How to set chunk size when using pathos ProcessingPool's map?


I'm running into inefficient parallelisation with Pathos' ProcessingPool.map() function: Towards the end of the processing, a single slow running worker processes the last tasks in the list sequentially while other workers are idle. I think this is due to "chunking" of the task list.

When using Python's own multiprocessing.Pool I can resolve this by forcing chunksize=1 when calling map. However, this argument is not supported by Pathos, and the source code suggests this may be an oversight or a todo on the developers' side:

return _pool.map(star(f), zip(*args)) # chunksize

(from Pathos' multiprocessing.py, line 137)

I'd like to keep Pathos because of it's ability to work with lamdbas.

Is there any way to get chunk size running in Pathos? Is there a workaround using one of Patho's other poorly documented pool implementations?


Solution

  • I'm the pathos developer. It's not an oversight... you can't use chunksize when using pathos.pools.ProcessingPool. The reason this was done, was that I wanted to have the map functions have the same interface as python's map... and to do that, based on the multiprocessing implementation, I either had to choose to make chunksize a keyword, or to allow *args and **kwds. So I choose the latter.

    If you want to use chunksize, there is _ProcessPool, which retains the original multiprocessing.Pool interface, but has augmented serialization.

    >>> import pathos
    >>> p = pathos.pools._ProcessPool() 
    >>> p.map(lambda x:x*x, range(4), chunksize=10)
    [0, 1, 4, 9]
    >>> 
    

    I'm sorry you feel the documentation is lacking. The code is primarily composed of a fork of multiprocessing from the python standard library... and I didn't change the documentation where the functionality has been reproduced. For example, here I am recycling the STL docs, as the functionality is the same:

    >>> p = pathos.pools._ProcessPool()
    >>> print(p.map.__doc__)
    
            Equivalent of `map()` builtin
    
    >>> p = multiprocessing.Pool()
    >>> print(p.map.__doc__)
    
            Equivalent of `map()` builtin
    >>>    
    

    ... and in the cases where I have modified functionality, I did write new docs:

    >>> p = pathos.pools.ProcessPool()
    >>> print(p.map.__doc__)
    run a batch of jobs with a blocking and ordered map
    
    Returns a list of results of applying the function f to the items of
    the argument sequence(s). If more than one sequence is given, the
    function is called with an argument list consisting of the corresponding
    item of each sequence.
    
    >>> 
    

    Admittedly, the docs could be better. Especially the docs coming from the STL could be improved upon. Please feel free to add a ticket on GitHub, or even better, a PR to extend the docs.