pythonmultiprocessing

multiprocessing.Pool(processes=0) not allowed?


In a parallelized script I would like to keep the ability to do the processing in a single CPU. In order to do that it seems that I need to duplicate the code that iterates over the results:

#!/usr/bin/env python3
import multiprocessing

# argparsed parameters:
num_cpus = 1
input_filename = 'file.txt'

def worker(line):
    '''Do something with a line'''
    # dummy processing that requires some CPU time
    for x in range(1000000):
        line.split()
    return line.split()

def data_iter(filename):
    '''iterate over the data records'''
    with open(filename) as fh:
        for line in fh:
            yield line

if __name__ == '__main__':
    if (num_cpus >= 2):
        with multiprocessing.Pool( processes = num_cpus - 1 ) as pool:
            for result in pool.imap(worker, data_iter(input_filename)):
                # dummy post-processing (copy#1)
                print(' '.join(result))
    else:
        for line in data_iter(input_filename):
            result = worker(line)
            # dummy post-processing (copy#2)
            print(' '.join(result))

note: you can use any text file with multiple lines for testing the code.

Is there any valid reason for multiprocessing.Pool not allowing processes = 0? Shouldn't processes = 0 just mean to not spawn any additional process, the main one being enough?


POSTSCRIPT

I though of one plausible "technical" reason for not allowing it:

The worker processes in the Pool can't modify the environment of the main process, which wouldn't be true if the worker function is run without spawning a new process.


Solution

  • The reason for not allowing multiprocessing.Pool(processes=0) is that a process pool with no processes in it cannot do any work. Such an object is surprising and generally unwanted.

    While it is true that processes=1 will spawn another process, it barely uses more than one CPU, because the main process will just sit and wait for the worker process to complete. It's possible to imagine optimizing processes=1 to have no overhead vs your custom code, but it would still be processes=1 not processes=0. There's got to be some process to run it on.

    If avoiding that small overhead of waiting for the pool is important, then the solution is to extract a small interface that can be implemented by both the in-process and pooled versions. In this case, that's the interface of builtins.map.

    Then at runtime, you decide which implementation to use.

    def mp_output(worker, my_iter):
        with multiprocessing.Pool( processes = num_cpus - 1 ) as pool:
            yield from pool.imap(worker, my_iter)
    
    
    if __name__ == '__main__':
        mapper = mp_output if num_cpus >= 2 else map
        for result in mapper(worker, data_iter(input_filename)):
            print(' '.join(result))
    

    If you can rely on the fact that you're doing pool.imap from the main function, you don't even need mp_output. You can just use

        mapper = multiprocessing.Pool(
            processes=num_cpus - 1).map if num_cpus >= 2 else map
    

    All the context manager exit does is invoke terminate, but being garbage-collected will also invoke terminate.

    P.S. This approach can work with objects as well as functions, and is a version of https://en.wikipedia.org/wiki/Strategy_pattern