python-2.7randomparallel-processingmultiprocessingembarrassingly-parallel

How to parallelize this piece of code?


I've been browsing for some time but couldn't find any constructive answer that I could comprehend.

How should I paralellize the following code:

import random
import math
import numpy as np
import sys
import multiprocessing

boot = 20#number of iterations to be performed
def myscript(iteration_number):  
    #stuff that the code actually does


def main(unused_command_line_args):
    for i in xrange(boot):
        myscript(i)
    return 0

if __name__ == '__main__':
    sys.exit(main(sys.argv))

or where can I read about it? I'm not really sure how to search for it even.


Solution

  • There's pretty much a natural progression from a for loop to parallel for a batch of embarrassingly parallel jobs.

    >>> import multiprocess as mp
    >>> # build a target function
    >>> def doit(x):
    ...   return x**2 - 1
    ... 
    >>> x = range(10)
    >>> # the for loop
    >>> y = []   
    >>> for i in x:
    ...   y.append(doit(i))
    ... 
    >>> y
    [-1, 0, 3, 8, 15, 24, 35, 48, 63, 80]
    

    So how to address this function in parallel?

    >>> # convert the for loop to a map (still serial)
    >>> y = map(doit, x)
    >>> y
    [-1, 0, 3, 8, 15, 24, 35, 48, 63, 80]
    >>> 
    >>> # build a worker pool for parallel tasks
    >>> p = mp.Pool()
    >>> # do blocking parallel
    >>> y = p.map(doit, x)
    >>> y
    [-1, 0, 3, 8, 15, 24, 35, 48, 63, 80]
    >>> 
    >>> # use an iterator (non-blocking)
    >>> y = p.imap(doit, x)
    >>> y            
    <multiprocess.pool.IMapIterator object at 0x10358d150>
    >>> print list(y)
    [-1, 0, 3, 8, 15, 24, 35, 48, 63, 80]
    >>> # do asynchronous parallel
    >>> y = p.map_async(doit, x)
    >>> y
    <multiprocess.pool.MapResult object at 0x10358d1d0>
    >>> print y.get()
    [-1, 0, 3, 8, 15, 24, 35, 48, 63, 80]
    >>>
    >>> # or if you like for loops, there's always this…
    >>> y = p.imap_unordered(doit, x)
    >>> z = []
    >>> for i in iter(y):
    ...   z.append(i)
    ... 
    >>> z
    [-1, 0, 3, 8, 15, 24, 35, 48, 63, 80]
    

    The last form is an unordered iterator, which tends to be the fastest… but you can't care about what order the results come back in -- they are unordered, and not guaranteed to return in the same order they were submitted.

    Note also that I've used multiprocess (a fork) instead of multiprocessing… but purely because multiprocess is better when dealing with interactively defined functions. Otherwise the code above is the same for multiprocessing.