bashgnu-parallelflock

Why is my counter that uses flock randomly resetting?


I'm trying to understand why I can't seem to successfully implement a counter that uses flock to avoid race conditions.

I copied the flock code from this SO answer.

The counter seems to randomly reset even though I don't see why it should. counter inc should atomically count up forever.

My code is intended so I can process multiple files at once using GNU parallel, but update a status line with each file that is processed that includes an incrementing counter.

The counter is saved to a temp file in RAM created by mktemp.

I've cut it down here so it's just an infinite loop that should count forever, but it keeps resetting.

Can anyone explain to me why the counter is resetting sometimes?

#!/bin/bash

clear_line () {
    echo -ne "\r\033[K"
}

counter () {
  {
    flock -s 200
    read numfiles < "$countfile"
    if [ "$1" = "inc" ]; then
        ((numfiles++))
    fi
    if [ "$1" = "rst" ]; then
        numfiles=0
    fi
    if [ "$1" = "get" ]; then
        echo "$numfiles"
    fi
    echo "$numfiles" > "$countfile"
  } 200> "$countlockfile"
}

process () {
    counter inc
    currfileno=$(counter get)

    clear_line
    echo -n "$currfileno"
}

# from https://unix.stackexchange.com/a/29918/52247
tmpdir=
cleanup () {
  trap - EXIT
  if [ -n "$tmpdir" ] ; then rm -rf "$tmpdir"; fi
  if [ -n "$1" ]; then trap - $1; kill -$1 $$; fi
}
tmpdir=$(mktemp -dt "$(basename $0).XXXXXXXX" --tmpdir=/run/user/$(id -u)) || exit 1
countfile=$(mktemp -t "counter.XXXXXXXX" --tmpdir=$tmpdir) || exit 1
countlockfile=$(mktemp -t "countlockfile.XXXXXXXX" --tmpdir=$tmpdir) || exit 1

trap 'cleanup' EXIT
trap 'cleanup HUP' HUP
trap 'cleanup TERM' TERM
trap 'cleanup INT' INT

export -f clear_line
export -f process
export -f counter
export countfile
export countlockfile

counter rst
while :
do
    echo whatever
done | parallel process

Solution

  • The script basically has a data-race. GNU parallel invokes multiple process for each job, so basically in your function, in between the calls to counter inc and counter get, there might be another process trying to increment or reset the count.

    process () {
        counter inc
        currfileno=$(counter get)  # This creates a race condition
        
        clear_line
        echo -n "$currfileno"
    }
    

    You basically need exclusive lock with flock -x, which prevents both reading and writing by other processes while the lock is held.

    Because your counter function is both reading and writing to the file in the same critical section. With a shared lock, multiple processes can enter the critical section simultaneously, each reading the same value from the file, incrementing their own copy, and then writing back leading to incorrect resets.