I have a bash script that applies different transformation/mappings on columns of TSV file. I am trying to parallelize the transformations using GNU parallel, however my code hangs.
For simplicity consider cat
, the identity mapper (i.e. input -> output), and a TSV file of three columns (generated on-the-fly using paste
and seq
s)
n=1000000
map=cat # identity: inp -> out
rm -f tmp.col{1,2}.fifo
mkfifo tmp.col{1,2}.fifo
paste <(seq $n) <(seq $n) <(seq $n) \
| tee >(cut -f1 | $map > tmp.col1.fifo) \
| tee >(cut -f2 | $map > tmp.col2.fifo) \
| cut -f3- \
| paste tmp.col{1,2}.fifo - \
| python -m tqdm > /dev/null
The above code works fine.
NOTE:
python -m tqdm > /dev/null
prints the speed
Next, we can parallelize the mapping tasks using GNU parallel's --pipe --keep-order
arguments. Here is a minimal parallel example that works:
seq 100 | parallel --pipe -k -j4 -N10 'cat && sleep 1'
Now, putting all these together, here is my code that maps the TSV columns in parallel:
n=1000000
map=cat # identity map: inp -> out
rm -f tmp.col{1,2}.fifo
mkfifo tmp.col{1,2}.fifo
paste <(seq $n) <(seq $n) <(seq $n) \
| tee >(cut -f1 | parallel --id jobA --pipe -k -j4 -N1000 "$map" > tmp.col1.fifo) \
| tee >(cut -f2 | parallel --id jobB --pipe -k -j4 -N1000 "$map" > tmp.col2.fifo) \
| cut -f3- \
| paste tmp.col{1,2}.fifo - \
| python -m tqdm > /dev/null
This code was supposed to work, however, this code freezes. Why does it freeze and how to unfreeze it?
Environment: Linux 5.15.0-116-generic, Ubuntu 22.04.4 LTS on x86_64
It is a race condition with the fifos - not GNU Parallel
Assume this:
| tee >(cut -f1 | $map1 > tmp.col1.fifo) \
| tee >(cut -f2 | $map2 > tmp.col2.fifo) \
| cut -f3- \
| paste tmp.col{1,2}.fifo - \
Assume that $map1
prints very little and $map2
prints a lot.
paste
tries to read a line from tmp.col1.fifo
, but there is nothing to read, so it blocks. $map2
prints a lot to tmp.col2.fifo
and fills the FIFO, so it blocks, too.
You have just been lucky that the race condition did not hit you earlier.
You can of course use temporary files to solve this, but I have the feeling you are trying to avoid that.
Maybe you can "increase" the size of the FIFO with a tool like mbuffer
:
| tee >(cut -f1 | parallel --pipe -k -j4 -N1000 "$map" | mbuffer -q -m6M -b5 > tmp.col1.fifo) \
| tee >(cut -f2 | parallel --pipe -k -j4 -N1000 "$map" | mbuffer -q -m6M -b5 > tmp.col2.fifo) \
| cut -f3- | mbuffer -q -m6M -b5 \
| paste tmp.col{1,2}.fifo - \
| python -m tqdm > /dev/null
But unless you know the nature of your data is not going to change, then this is a fragile solution that just kicks the can a bit further down the road.
How about this instead?
n=1000000
map=cat # identity map: inp -> out
rm -f tmp.col{1,2,3,4}.fifo
mkfifo tmp.col{1,2,3,4}.fifo
paste <(seq $n) <(seq $n) <(seq $n) | cut -f1 | parallel --pipe -k -j4 -N1000 "$map" > tmp.col1.fifo &
paste <(seq $n) <(seq $n) <(seq $n) | cut -f2 | parallel --pipe -k -j4 -N1000 "$map" > tmp.col2.fifo &
paste <(seq $n) <(seq $n) <(seq $n) | cut -f3 > tmp.col3.fifo &
paste <(seq $n) <(seq $n) <(seq $n) > tmp.col4.fifo &
paste tmp.col{1,2,3,4}.fifo | python -m tqdm > /dev/null
You will run a few more paste
s, but if CPU is not a problem, then this should give you no race conditions.
(Also: --id
(aka. --semaphore-name
) is not used with --pipe
but only with --semaphore
. See https://www.gnu.org/software/parallel/parallel_options_map.pdf)
(Also also: If you do not need exactly 1000 entries (-N1000
) then --block
is faster).