I'm trying to implement a task in parallel using Concurrent. Please find below a piece of code for it:
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
import concurrent.futures
# num CPUs
cpu_num = len(os.sched_getaffinity(0))
print("Number of cpu available : ",cpu_num)
# max_Worker = cpu_num
max_Worker = 1
# A fake input array
n=1000000
array = list(range(n))
results = []
# A fake function being applied to each element of array
def task(i):
return i**2
x = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=max_Worker) as executor:
features = {executor.submit(task, j) for j in array}
# the real function is heavy and we need to be sure of completeness of each run
for future in concurrent.futures.as_completed(features):
results.append(future.result())
results = [future.result() for future in features]
y = time.time()
print('=========================================')
print(f"Train data preparation time (s): {(y-x)}")
print('=========================================')
And now my questions,
Thanks in advance,
See my comment to your question. To the overhead I mentioned in that comment you need to also add the overhead in just creating the process pool itself.
The following is a benchmark with several results. The first is a timing from just calling the worker function task
100000 times and creating a results
list and printing out the last element of that list. It will become apparent why I have reduced the number of times I am calling task
from 1000000 to 100000.
The next attempt is to use multiprocessing to accomplish the same thing using a ProcessPoolExecutor with the submit
method and then processing the Future
instances that are returned.
The next attempt is to instead use the map
method with the default chunksize argument of 1 being used. It is important to understand this argument. With a chunksize value of 1, each element of the iterable that is passed to the map
method is written individually to a queue of tasks as a chunk to be processed by the processes in the pool. When a pool process becomes idle looking for work, it pulls from the queue the next chunk of tasks to be performed, processes each task comprising the chunk and then becomes idle again. When there are a lot of submitted tasks being submitted via map
, a chunksize value of 1 is inefficient. You would expect its performance to be equivalent to repeatedly issuing submit
calls for each element of the iterable.
The next attempt specifies a chunksize value which approximates more or less the value that the map
function used by the Pool
class in the multiprocessing
package would have used by default. As you can see, the improvement is dramatic, but still not an improvement over the non-multiprocessing case.
The final attempt uses the multiprocessing faciltity provided by package multiprocessing
and its multiprocessing.pool.Pool
class. The difference in this benchmark is that its map
function uses a more intelligent default chunksize when no chunksize argument is specified.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
return i**2
# required for Windows:
if __name__ == '__main__':
n=100000
t1 = time.time()
results = [task(i) for i in range(n)]
print('Non-multiprocessing time:', time.time() - t1, results[-1])
# num CPUs
cpu_num = os.cpu_count()
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
print('Multiprocessing time using submit:', time.time() - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
print('Multiprocessing time using map:', time.time() - t1, results[-1])
t1 = time.time()
chunksize = n // (4 * cpu_num)
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
print(f'Multiprocessing time using map: {time.time() - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
print('Multiprocessing time using Pool.map:', time.time() - t1, results[-1])
Prints:
Non-multiprocessing time: 0.027019739151000977 9999800001
Number of CPUs available: 8
Multiprocessing time using submit: 77.34723353385925 9999800001
Multiprocessing time using map: 79.52981925010681 9999800001
Multiprocessing time using map: 0.30500149726867676, chunksize: 3125 9999800001
Multiprocessing time using Pool.map: 0.2799997329711914 9999800001
Update
The following bechmarks use a version of task
that is very CPU-intensive and shows the benefit of multiprocessing. It would also seem for this small iterable size (100), forcing a chunksize value of 1 for the Pool.map
case (it would by default compute a chunksize value of 4), is slightly more performant.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
for _ in range(1_000_000):
result = i ** 2
return result
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, pool_size * 4)
if remainder:
chunksize += 1
return chunksize
# required for Windows:
if __name__ == '__main__':
n = 100
cpu_num = os.cpu_count()
chunksize = compute_chunksize(n, cpu_num)
t1 = time.time()
results = [task(i) for i in range(n)]
t2 = time.time()
print('Non-multiprocessing time:', t2 - t1, results[-1])
# num CPUs
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
t2 = time.time()
print('Multiprocessing time using submit:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
t2 = time.time()
print('Multiprocessing time using map:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
t2 = time.time()
print(f'Multiprocessing time using map: {t2 - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
t2 = time.time()
print('Multiprocessing time using Pool.map:', t2 - t1, results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n), chunksize=1)
t2 = time.time()
print('Multiprocessing time using Pool.map (chunksize=1):', t2 - t1, results[-1])
Prints:
Non-multiprocessing time: 23.12758779525757 9801
Number of CPUs available: 8
Multiprocessing time using submit: 5.336004018783569 9801
Multiprocessing time using map: 5.364996671676636 9801
Multiprocessing time using map: 5.444890975952148, chunksize: 4 9801
Multiprocessing time using Pool.map: 5.400001287460327 9801
Multiprocessing time using Pool.map (chunksize=1): 4.698001146316528 9801