bash

Parallelize Bash script with maximum number of processes


Lets say I have a loop in Bash:

for foo in `some-command`
do
   do-something $foo
done

do-something is cpu bound and I have a nice shiny 4 core processor. I'd like to be able to run up to 4 do-something's at once.

The naive approach seems to be:

for foo in `some-command`
do
   do-something $foo &
done

This will run all do-somethings at once, but there are a couple downsides, mainly that do-something may also have some significant I/O which performing all at once might slow down a bit. The other problem is that this code block returns immediately, so no way to do other work when all the do-somethings are finished.

How would you write this loop so there are always X do-somethings running at once?


Solution

  • Depending on what you want to do xargs also can help (here: converting documents with pdf2ps):

    cpus=$( ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w )
    
    find . -name \*.pdf | xargs --max-args=1 --max-procs=$cpus  pdf2ps
    

    From the docs:

    --max-procs=max-procs
    -P max-procs
           Run up to max-procs processes at a time; the default is 1.
           If max-procs is 0, xargs will run as many processes as  possible  at  a
           time.  Use the -n option with -P; otherwise chances are that only one
           exec will be done.