I'm running into inefficient parallelisation with Pathos' ProcessingPool.map()
function: Towards the end of the processing, a single slow running worker processes the last tasks in the list sequentially while other workers are idle. I think this is due to "chunking" of the task list.
When using Python's own multiprocessing.Pool
I can resolve this by forcing chunksize=1
when calling map
. However, this argument is not supported by Pathos, and the source code suggests this may be an oversight or a todo on the developers' side:
return _pool.map(star(f), zip(*args)) # chunksize
(from Pathos' multiprocessing.py
, line 137)
I'd like to keep Pathos because of it's ability to work with lamdbas.
Is there any way to get chunk size running in Pathos? Is there a workaround using one of Patho's other poorly documented pool implementations?
I'm the pathos
developer. It's not an oversight... you can't use chunksize
when using pathos.pools.ProcessingPool
. The reason this was done, was that I wanted to have the map
functions have the same interface as python's map
... and to do that, based on the multiprocessing
implementation, I either had to choose to make chunksize
a keyword, or to allow *args
and **kwds
. So I choose the latter.
If you want to use chunksize
, there is _ProcessPool
, which retains the original multiprocessing.Pool
interface, but has augmented serialization.
>>> import pathos
>>> p = pathos.pools._ProcessPool()
>>> p.map(lambda x:x*x, range(4), chunksize=10)
[0, 1, 4, 9]
>>>
I'm sorry you feel the documentation is lacking. The code is primarily composed of a fork of multiprocessing
from the python standard library... and I didn't change the documentation where the functionality has been reproduced. For example, here I am recycling the STL docs, as the functionality is the same:
>>> p = pathos.pools._ProcessPool()
>>> print(p.map.__doc__)
Equivalent of `map()` builtin
>>> p = multiprocessing.Pool()
>>> print(p.map.__doc__)
Equivalent of `map()` builtin
>>>
... and in the cases where I have modified functionality, I did write new docs:
>>> p = pathos.pools.ProcessPool()
>>> print(p.map.__doc__)
run a batch of jobs with a blocking and ordered map
Returns a list of results of applying the function f to the items of
the argument sequence(s). If more than one sequence is given, the
function is called with an argument list consisting of the corresponding
item of each sequence.
>>>
Admittedly, the docs could be better. Especially the docs coming from the STL could be improved upon. Please feel free to add a ticket on GitHub, or even better, a PR to extend the docs.