pythonscikit-learnmultiprocessingjoblibgridsearchcv

Nested parallelism with GridSearchCV causes infinite hang


I'm running a GridSearchCV optimization into a parallelized function. The pseudocode looks like this

from tqdm.contrib.concurrent import process_map
from sklearn.model_selection import GridSearchCV

def main():
    results = process_map(func, it, max_workers=5)
    # We never reach here with n_jobs > 1 in GridSearch

def func(it):
    ...
    grid_search = GridSearchCV(..., n_jobs=5)
    ...
    return result

if __name__ == "__main__":
    main()

If n_jobs > 1 the script indefinitely hangs when returning results and never proceeds further (although all the func tasks have completed). If I set n_jobs=1 then everything works fine.

I think (but I'm not sure) this is related to the fact that process_map uses a different spawn mechansim from GridSearchCV (which internally uses joblib if I understand it correctly).

As the heaviest part of this algorithm is the grid search, isn't there any way of maintaining that parallelism together with the outer parallelism layer?


Solution

  • I think (but I'm not sure) this is related to the fact that process_map uses a different spawn mechanism from GridSearchCV

    Yes, both process_map and GridSearchCV have their own mechanism to process data, which causes collision.

    You can use threading by Python's built-in ThreadPoolExecutor

    from concurrent.futures import ThreadPoolExecutor
    from functools import partial
    
    def main():
        with ThreadPoolExecutor(max_workers=5) as executor:
            results = list(tqdm(executor.map(func, it), total=len(it)))
    

    Or configure joblib (you already have it since it's a dependency of scikit-learn) to use threading (outdated now).

    from sklearn.utils import parallel_backend
    
    def func(it):
        with parallel_backend('threading', n_jobs=5):
            grid_search = GridSearchCV(...)
            # rest of your code