In the book https://www.gnu.org/software/parallel/parallel_tutorial.html#number-of-simultaneous-jobs
/usr/bin/time parallel -N0 sleep 1 :::: num128
(same number of job as cores)/usr/bin/time parallel -N0 --jobs 200% sleep 1 :::: num128
(2 jobs per core)/usr/bin/time parallel -N0 --jobs 0 sleep 1 :::: num128
(as many jobs as possible)I assume the book is refering to logical cores, not physical cores. My machine is 13th Gen Intel i7-13700HX (16 Cores, 24 Logical Processors).
num128
contains a newline separated list of numbers from 1 to 128. The actual value doesn't matter as much as number of rows it occupies which affects the total job number.
Question
How does option 2 put 2 jobs on 1 core and complete both in 1 second? (total input requires 3 iterations of job allocation and sleep)?
Shouldn't the second job's sleep be waiting for the first job's sleep to complete, so each of 3 iterations would take 2 seconds for a total of 6 seconds?
Is parallel running both sleeps in the background on 1 core concurrently so their timing is overlapped?
What's going on generally when you specify multiple jobs on 1 core using parallel? Does it do multithreading or start all those jobs on the same logical core as background processes?
I wonder if this example of multiple jobs per core only works because multiple sleep timings can overlap when they are started in background, and if i changed to another command that cannot run concurrently, it would be pointless to start multiple jobs per core like options 2 and 3.
In option 3, what constrains the upper limit to "as many jobs in parallel as possible"?
Based on these 3 examples it makes me think i should always choose option 3 as the fastest. When is this not true anymore?
I tried pushing limits by increasing input size 2x using /usr/bin/time parallel -N0 --jobs 0 sleep 1 :::: <(cat num128 num128)
and still it got things done within 1 iteration of 1.83 seconds.
Summary
I don't understand the work allocation models of option 2 and 3, why option 2 is faster than i expected, why option 3 timing seems unaffected by how much more jobs over logical cores there are, and unaffected by increased input size, and whether the book demonstrations are only specific to the sleep
command, and so may not be generalized
If you want to count iterations
Tweak above example commands by wrapping sleep with echo and observe printing cycles
'echo start job number {#} job slot {%};sleep 1;echo finish job number {#} job slot {%}'
Possibly relevant Quirks
parallel --number-of-cores
--> 12 (does not match 16 from system information?)
parallel --number-of-sockets
--> 1
parallel --number-of-threads
--> 24
There are also options of --use-cores-instead-of-threads
and --use-sockets-instead-of-threads
The CPU detection is a dark science: I really need more users to send output from lscpu
, /sys/devices/system/cpu*
and /proc/cpuinfo
and what the expected interpreted values are.
This explains why parallel --number-of-cores
gives 12 instead of the correct 16.
The reason why it may make sense to run -j200%
is because some computations do a lot of waiting - sleep 1
is an extreme example of this: There is no problem in running 2 sleep 1
on the same CPU thread in parallel. The same goes for jobs like wget
where you are dependent on network I/O.
If you gave jobs that are computationally heavy (e.g. bzip2 -9
) then it will typically not make sense to run more jobs in parallel than you have CPU threads.
Technically GNU Parallel does not start a job "on a core". It simply start a job. The OS determines which CPU thread to schedule the job on.
You can run many sleep 1
on a single CPU thread because they take up very little CPU time. Thus the limit your are going to hit is not the computational power of the CPU but other limits (such a file handles - GNU Parallel uses 4 per concurrent job).
I believe the main point of confusion might be that running multiple instances of sleep 1
on the same CPU thread is feasible because these commands don't require much CPU power. However, if you try to run bzip2 -9
in the same manner, you'll find that running more than one job per CPU thread is not practical.