I'm wondering, which is better to use with GridSearchCV( ..., n_jobs = ... )
to pick the best parameter set for a model, n_jobs = -1
or n_jobs
with a big number,
like n_jobs = 30
?
Based on Sklearn documentation:
n_jobs = -1
means that the computation will be dispatched on all the CPUs of the computer.
On my PC I have an Intel i3 CPU, which has 2 cores and 4 threads, so does that mean if I set n_jobs = -1
, implicitly it will be equal to n_jobs = 2
?
... does that mean if I set
n_jobs = -1
, implicitly it will be equal ton_jobs = 2
?
python ( scipy / joblib inside a GridSearchCV()
) used to detect the number of CPU-cores, that is reasonable to schedule concurrent ( independent ) processes, given a request was done with an n_jobs = -1
setting.
In some virtualised-machine cases, that can synthetically emulate CPU / cores, the results are not as trivial as in your known Intel CPU / i3 case.
If in doubts, one can test this with a trivialised case ( on an indeed small data-set, not the full-blown model-space search ... ) and let the story go on to prove this.
import psutil; print( "{0:17s}{1:} CPUs PHYSICAL".format(
"psutil:",
psutil.cpu_count( logical = False ) ) )
pass; print( "{0:17s}{1:} CPUs LOGICAL".format(
"psutil:",
psutil.cpu_count( logical = True ) ) )
...
A similar host-platform "self-detection" may report more details for different systems / settings:
'''
sys: linux
3.6.1 (default, Jun 27 2017, 14:35:15) .. [GCC 7.1.1 20170622 (Red Hat 7.1.1-3)]
multiprocessing: 1 CPU(s)
psutil: 1 CPUs PHYSICAL
psutil: 1 CPUs LOGICAL
psutil: psutil.cpu_freq( per_cpu = True ) not able to report. ?( v5.1.0+ )
psutil: 5.0.1
psutil: psutil.cpu_times( per_cpu = True ) not able to report. ?( vX.Y.Z+ )
psutil: 5.0.1
psutil: svmem(total=1039192064, available=257290240, percent=75.2, used=641396736, free=190361600, active=581107712, inactive=140537856, buffers=12210176, cached=195223552, shared=32768)
numexpr: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: No module named 'numexpr'.
joblib: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: No module named 'joblib'.
sklearn/joblib: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ModuleNotFoundError: No module named 'sklearn.externals.joblib'
'''
Or
''' [i5]
>>> numexpr.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Numexpr version: 2.5
NumPy version: 1.10.4
Python version: 2.7.13 |Anaconda 4.0.0 (32-bit)| (default, May 11 2017, 14:07:41) [MSC v.1500 32 bit (Intel)]
AMD/Intel CPU? True
VML available? True
VML/MKL version: Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for 32-bit applications
Number of threads used by default: 4 (out of 4 detected cores)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
'''
... which is better to use with GridSearchCV to pick the best parameter set for a model,
n_jobs = -1
orn_jobs
with a big number liken_jobs = 30
?
The Scikit tools ( and many other followed this practice ) used to spawn, on n_jobs
directive being used, a required amount of concurrent process-instances ( so as to escape from shared GIL-lock stepping - read more on this elsewhere if interested in details ).
This process-instantiation is not cost-free ( both time-wise, i.e. spending a respectfull amount of the [TIME]
-domain costs, but also space-wise, i.e. spending at least an n_jobs
-times the RAM-allocations of the single python process-instance in [SPACE]
-domain ).
Given this, your fight is a battle against a dual-edged sword.
An attempt to "underbook" CPU will let ( some ) CPU-cores possibly idling.
An attempt to "overbook" RAM-space will turn your performance worse than expected, as virtual-memory will turn operating system swapping, which turns your Machine Learning-scaled data-access times from ~ 10+[ns]
more than 100,000 x slower ~ 10+ [ms]
which is hardly what one will be pleased at.
The overall effects of n_jobs = a_reasonable_amount_of_processes
is subject of Amdahl's Law ( the re-formulated one, not an add-on overhead-naive version ), so there will be a practical optimality peak ( a maximum ) of how many CPU-cores will help to improve one's processing intentions, beyond of which the overhead-costs ( sketched for both the [TIME]
- and [SPACE]
-domains above ) will actually deteriorate any potential positive impact expectations.
Having used RandomForestRegressor()
on indeed large data-sets in production, I can tell you the [SPACE]
-domain is your worse of the enemies in trying to grow n_jobs
any farther and none system-level tuning will ever overcome this boundary ( so more and more ultra-low latency RAM and more and more ( real ) CPU-cores is the only practical recipe for going into indeed any larger n_jobs
computing plans ).