performancebashshellhpcsupercomputers

Efficiently computing floating-point arithmetic hundreds of thousands of times in Bash


Background

I work for a research institute that studies storm surges computationally, and am attempting to automate some of the HPC commands using Bash. Currently, the process is we download the data from NOAA and create the command file manually, line-by-line, inputting the location of each file along with a time for the program to read the data from that file and a wind magnification factor. There are hundreds of these data files in each download NOAA produces, which come out every 6 hours or so when a storm is in progress. This means that much of our time during a storm is spent making these command files.

Problem

I am limited in the tools I can use to automate this process because I simply have a user account and a monthly allotment of time on the supercomputers; I do not have the privilege to install new software on them. Plus, some of them are Crays, some are IBMs, some are HPs, and so forth. There isn't a consistent operating system between them; the only similarity is they are all Unix-based. So I have at my disposal tools like Bash, Perl, awk, and Python, but not necessarily tools like csh, ksh, zsh, bc, et cetera:

$ bc
-bash: bc: command not found

Further, my lead scientist has requested that all of the code I write for him be in Bash because he understands it, with minimal calls to external programs for things Bash cannot do. For example, it cannot do floating point arithmetic, and I need to be able to add floats. I can call Perl from within Bash, but that's slow:

$ time perl -E 'printf("%.2f", 360.00 + 0.25)'
360.25
real    0m0.052s
user    0m0.015s
sys     0m0.015s

1/20th of a second doesn't seem like a long time, but when I have to make this call 100 times in a single file, that equates to about 5 seconds to process one file. That isn't so bad when we are only making one of these every 6 hours. However, if this work is abstracted to a larger assignment, one where we point 1,000 synthetic storms at the Atlantic basin at one time in order to study what could have happened had the storm been stronger or taken a different path, 5 seconds quickly grows to more than an hour just to process text files. When you are billed by the hour, this poses a problem.

Question

What is a good way to speed this up? I currently have this for loop in the script (the one that takes 5 seconds to run):

for FORECAST in $DIRNAME; do
    echo $HOURCOUNT"  "$WINDMAG"  "${FORECAST##*/} >> $FILENAME;
    HOURCOUNT=$(echo "$HOURCOUNT $INCREMENT" | awk '{printf "%.2f", $1 + $2}');
done

I know a single call to awk or Perl to loop through the data files would be a hundred times faster than calling either once for each file in the directory, and that these languages can easily open a file and write to it, but the problem I am having is getting data back and forth. I have found a lot of resources on these three languages alone (awk, Perl, Python), but haven't been able to find as much on embedding them in a Bash script. The closest I have been able to come is to make this shell of an awk command:

awk -v HOURCOUNT="$HOURCOUNT" -v INCREMENT="$INCREMENT" -v WINDMAG="$WINDMAG" -v DIRNAME="$DIRNAME" -v FILENAME="$FILENAME" 'BEGIN{ for (FORECAST in DIRNAME) do
    ...
}'

But I am not certain that this is correct syntax, and if it is, if it's the best way to go about this, or if it will even work at all. I have been hitting my head against the wall for a few days now and decided to ask the internet before I plug on.


Solution

  • Bash is very capable as long as you have the ability you need. For floating point, you basically have two options, either bc (which at least on the box you show isn't installed [which is kind of hard to believe]) or calc. calc-2.12.4.13.tar.bz2

    Either package is flexible and very capable floating-point programs that integrate well with bash. Since the powers that be have a preference for bash, I would investigate installing either bc or calc. (job security is a good thing)

    If your superiors can be convinced to allow either perl or python, then either will do. If you have never programmed in either, both will have a learning curve, python slightly more so than perl. If you superiors there can read bash, then translating perl would be much easier to digest for them than python.

    This is a fair outline of the options you have given your situation as you've explained it. Regardless of your choice, the task for you should not be that daunting in any of the languages. Just drop a line back when you get stuck.