I'm running a GridSearchCV
optimization into a parallelized function. The pseudocode looks like this
from tqdm.contrib.concurrent import process_map
from sklearn.model_selection import GridSearchCV
def main():
results = process_map(func, it, max_workers=5)
# We never reach here with n_jobs > 1 in GridSearch
def func(it):
...
grid_search = GridSearchCV(..., n_jobs=5)
...
return result
if __name__ == "__main__":
main()
If n_jobs
> 1 the script indefinitely hangs when returning results
and never proceeds further (although all the func
tasks have completed). If I set n_jobs=1
then everything works fine.
I think (but I'm not sure) this is related to the fact that process_map
uses a different spawn mechansim from GridSearchCV
(which internally uses joblib if I understand it correctly).
As the heaviest part of this algorithm is the grid search, isn't there any way of maintaining that parallelism together with the outer parallelism layer?
I think (but I'm not sure) this is related to the fact that process_map uses a different spawn mechanism from GridSearchCV
Yes, both process_map
and GridSearchCV
have their own mechanism to process data, which causes collision.
You can use threading by Python's built-in ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor
from functools import partial
def main():
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(tqdm(executor.map(func, it), total=len(it)))
Or configure joblib (you already have it since it's a dependency of scikit-learn) to use threading (outdated now).
from sklearn.utils import parallel_backend
def func(it):
with parallel_backend('threading', n_jobs=5):
grid_search = GridSearchCV(...)
# rest of your code