linuxbashgnu-parallel

GNU parallel freezes


I have a bash script that applies different transformation/mappings on columns of TSV file. I am trying to parallelize the transformations using GNU parallel, however my code hangs.

For simplicity consider cat, the identity mapper (i.e. input -> output), and a TSV file of three columns (generated on-the-fly using paste and seqs)

n=1000000
map=cat    # identity: inp -> out

rm -f tmp.col{1,2}.fifo
mkfifo tmp.col{1,2}.fifo
paste <(seq $n) <(seq $n) <(seq $n) \
    | tee >(cut -f1 | $map > tmp.col1.fifo) \
    | tee >(cut -f2 | $map > tmp.col2.fifo) \
    | cut -f3- \
    | paste tmp.col{1,2}.fifo - \
    | python -m tqdm > /dev/null

The above code works fine.

NOTE: python -m tqdm > /dev/null prints the speed

Next, we can parallelize the mapping tasks using GNU parallel's --pipe --keep-order arguments. Here is a minimal parallel example that works:

seq 100 | parallel --pipe -k -j4 -N10 'cat && sleep 1'

Now, putting all these together, here is my code that maps the TSV columns in parallel:

n=1000000
map=cat   # identity map: inp -> out
rm -f tmp.col{1,2}.fifo
mkfifo tmp.col{1,2}.fifo
paste <(seq $n) <(seq $n) <(seq $n) \
  | tee >(cut -f1 | parallel --id jobA --pipe -k -j4 -N1000 "$map" > tmp.col1.fifo) \
  | tee >(cut -f2 | parallel --id jobB --pipe -k -j4 -N1000 "$map" > tmp.col2.fifo) \
  | cut -f3- \
  | paste tmp.col{1,2}.fifo - \
  | python -m tqdm > /dev/null

This code was supposed to work, however, this code freezes. Why does it freeze and how to unfreeze it?

Environment: Linux 5.15.0-116-generic, Ubuntu 22.04.4 LTS on x86_64


Solution

  • It is a race condition with the fifos - not GNU Parallel

    Assume this:

    | tee >(cut -f1 | $map1 > tmp.col1.fifo) \
    | tee >(cut -f2 | $map2 > tmp.col2.fifo) \
    | cut -f3- \
    | paste tmp.col{1,2}.fifo - \
    

    Assume that $map1 prints very little and $map2 prints a lot.

    paste tries to read a line from tmp.col1.fifo, but there is nothing to read, so it blocks. $map2 prints a lot to tmp.col2.fifo and fills the FIFO, so it blocks, too.

    You have just been lucky that the race condition did not hit you earlier.

    You can of course use temporary files to solve this, but I have the feeling you are trying to avoid that.

    Maybe you can "increase" the size of the FIFO with a tool like mbuffer:

      | tee >(cut -f1 | parallel --pipe -k -j4 -N1000 "$map" | mbuffer -q -m6M -b5 > tmp.col1.fifo) \
      | tee >(cut -f2 | parallel --pipe -k -j4 -N1000 "$map" | mbuffer -q -m6M -b5 > tmp.col2.fifo) \
      | cut -f3- | mbuffer -q -m6M -b5 \
      | paste tmp.col{1,2}.fifo - \
      | python -m tqdm > /dev/null
    

    But unless you know the nature of your data is not going to change, then this is a fragile solution that just kicks the can a bit further down the road.

    How about this instead?

    n=1000000
    map=cat   # identity map: inp -> out
    rm -f tmp.col{1,2,3,4}.fifo
    mkfifo tmp.col{1,2,3,4}.fifo
    paste <(seq $n) <(seq $n) <(seq $n) | cut -f1 | parallel --pipe -k -j4 -N1000 "$map" > tmp.col1.fifo &
    paste <(seq $n) <(seq $n) <(seq $n) | cut -f2 | parallel --pipe -k -j4 -N1000 "$map" > tmp.col2.fifo &
    paste <(seq $n) <(seq $n) <(seq $n) | cut -f3 > tmp.col3.fifo &
    paste <(seq $n) <(seq $n) <(seq $n) > tmp.col4.fifo &
    paste tmp.col{1,2,3,4}.fifo | python -m tqdm > /dev/null
    

    You will run a few more pastes, but if CPU is not a problem, then this should give you no race conditions.

    (Also: --id (aka. --semaphore-name) is not used with --pipe but only with --semaphore. See https://www.gnu.org/software/parallel/parallel_options_map.pdf)

    (Also also: If you do not need exactly 1000 entries (-N1000) then --block is faster).