Efficiently computing floating-point arithmetic hundreds of thousands of times in Bash
Asked Answered
F

3

6

Background

I work for a research institute that studies storm surges computationally, and am attempting to automate some of the HPC commands using Bash. Currently, the process is we download the data from NOAA and create the command file manually, line-by-line, inputting the location of each file along with a time for the program to read the data from that file and a wind magnification factor. There are hundreds of these data files in each download NOAA produces, which come out every 6 hours or so when a storm is in progress. This means that much of our time during a storm is spent making these command files.

Problem

I am limited in the tools I can use to automate this process because I simply have a user account and a monthly allotment of time on the supercomputers; I do not have the privilege to install new software on them. Plus, some of them are Crays, some are IBMs, some are HPs, and so forth. There isn't a consistent operating system between them; the only similarity is they are all Unix-based. So I have at my disposal tools like Bash, Perl, awk, and Python, but not necessarily tools like csh, ksh, zsh, bc, et cetera:

$ bc
-bash: bc: command not found

Further, my lead scientist has requested that all of the code I write for him be in Bash because he understands it, with minimal calls to external programs for things Bash cannot do. For example, it cannot do floating point arithmetic, and I need to be able to add floats. I can call Perl from within Bash, but that's slow:

$ time perl -E 'printf("%.2f", 360.00 + 0.25)'
360.25
real    0m0.052s
user    0m0.015s
sys     0m0.015s

1/20th of a second doesn't seem like a long time, but when I have to make this call 100 times in a single file, that equates to about 5 seconds to process one file. That isn't so bad when we are only making one of these every 6 hours. However, if this work is abstracted to a larger assignment, one where we point 1,000 synthetic storms at the Atlantic basin at one time in order to study what could have happened had the storm been stronger or taken a different path, 5 seconds quickly grows to more than an hour just to process text files. When you are billed by the hour, this poses a problem.

Question

What is a good way to speed this up? I currently have this for loop in the script (the one that takes 5 seconds to run):

for FORECAST in $DIRNAME; do
    echo $HOURCOUNT"  "$WINDMAG"  "${FORECAST##*/} >> $FILENAME;
    HOURCOUNT=$(echo "$HOURCOUNT $INCREMENT" | awk '{printf "%.2f", $1 + $2}');
done

I know a single call to awk or Perl to loop through the data files would be a hundred times faster than calling either once for each file in the directory, and that these languages can easily open a file and write to it, but the problem I am having is getting data back and forth. I have found a lot of resources on these three languages alone (awk, Perl, Python), but haven't been able to find as much on embedding them in a Bash script. The closest I have been able to come is to make this shell of an awk command:

awk -v HOURCOUNT="$HOURCOUNT" -v INCREMENT="$INCREMENT" -v WINDMAG="$WINDMAG" -v DIRNAME="$DIRNAME" -v FILENAME="$FILENAME" 'BEGIN{ for (FORECAST in DIRNAME) do
    ...
}'

But I am not certain that this is correct syntax, and if it is, if it's the best way to go about this, or if it will even work at all. I have been hitting my head against the wall for a few days now and decided to ask the internet before I plug on.

Foodstuff answered 2/7, 2014 at 18:54 Comment(11)
If you have Perl and Python available, why don't you write your scripts entirely in them? The inefficiency you saw comes from having to start up the entire Perl interpreter just for one statement. If you have a Perl script with 50-100 lines, it will be very efficient because the startup and parsing cost is amortized.Daffi
Because the work is already done, besides the inefficiency. I would have to start over. Further, my PI prefers I write this in Bash. I will edit the question to include that information.Foodstuff
One possibility is to start up a Perl coprocess. Then you can feed floating point expressions to it and it will send back the result.Daffi
@Daffi wow, that actually looks like a great idea. I'd never heard of that before. I will try that and comment back, but it sounds like exactly what I need.Foodstuff
Can you use a template to aggregate many/all of the data in these files into fewer external calls? for example, create an array of the contents of multiple files in some structured format, then call perl with the contents of the array.Christos
bash allows loadable modules -- if you want to add a new builtin that does the floating-point math you need, you can write that in C and load it at runtime.Bohman
"When you are billed by the hour, this poses a problem." I think you may be able to make a good business case to your PI for using Perl or Python.Gyneco
@halfer, yes, PI is Principle Investigator, the lead scientist on a project.Foodstuff
I wonder if you can write Perl scripts that look enough like bash that your PI will be able to understand them. In fact it might be easier to understand than a bash script littered with stuff like perl -E 'printf("%.2f", 360.00 + 0.25)'.Croon
Using the time data I gathered, I was able to convince my PI to let me rewrite the script in Python. The time to execute went from about 5 seconds down to about a third of a second. Thank you all for nudging me enough to rewrite it using the correct tool for the job.Foodstuff
Has the PI ever looked at Python code? Writing numerical computation in Bash syntax makes about as much sense as writing an operating system in Fortran.Demagogy
V
3

Bash is very capable as long as you have the ability you need. For floating point, you basically have two options, either bc (which at least on the box you show isn't installed [which is kind of hard to believe]) or calc. calc-2.12.4.13.tar.bz2

Either package is flexible and very capable floating-point programs that integrate well with bash. Since the powers that be have a preference for bash, I would investigate installing either bc or calc. (job security is a good thing)

If your superiors can be convinced to allow either perl or python, then either will do. If you have never programmed in either, both will have a learning curve, python slightly more so than perl. If you superiors there can read bash, then translating perl would be much easier to digest for them than python.

This is a fair outline of the options you have given your situation as you've explained it. Regardless of your choice, the task for you should not be that daunting in any of the languages. Just drop a line back when you get stuck.

Vaules answered 2/7, 2014 at 20:9 Comment(4)
Yes, I wonder if searching each box for bc might be worthwhile - could this just have dropped out of the path?Pharmacopsychosis
I would have much preferred to write all of this in csh; that is the shell I am most comfortable in. I will ask to see if we can get bc installed on the machine I tested.Foodstuff
It is usually installed in /usr/bin so unless you have completely lost your executable path typing bc should work. If for some strange reason, the permissions on bc are mucked up, it would not show as executable, but a ls -al of your executable path would find it. Check your path with set | grep ^PATH and go from there.Vaules
Using the time data I gathered, I was able to convince my PI to let me rewrite the script in Python. The time to execute went from about 5 seconds down to about a third of a second. Thank you for nudging me enough to rewrite it using the correct tool for the job.Foodstuff
A
1

Starting awk or another command just to do a single addition is never going to be efficient. Bash can't handle floats, so you need to shift your perspective. You say you only need to add floats, and I gather these floats represent a duration in hours. So use seconds instead.

for FORECAST in $DIRNAME; do
    printf "%d.%02d  %s  %s\n" >> $FILENAME \
        $((SECONDCOUNT / 3600)) \
        $(((SECONDCOUNT % 3600) * 100 / 3600)) \
        $WINDMAG \
        ${FORECAST##*/}

    SECONDCOUNT=$((SECONDCOUNT + $SECONDS_INCREMENT))
done

(printf is standard and much nicer than echo for formatted output)

EDIT: Abstracted as a function and with a bit of demonstration code:

function format_as_hours {
    local seconds=$1
    local hours=$((seconds / 3600))
    local fraction=$(((seconds % 3600) * 100 / 3600))
    printf '%d.%02d' $hours $fraction
}

# loop for 0 to 2 hours in 5 minute steps
for ((i = 0; i <= 7200; i += 300)); do
    format_as_hours $i
    printf "\n"
done
Ankylosis answered 2/7, 2014 at 19:37 Comment(7)
Would this not pose a problem if $((SECONDCOUNT / 3600)) was fractional?Foodstuff
Bash will discard any fractional part, just like integer division in C.Ankylosis
Then that doesn't do me any favors. I have to maintain the fractional component.Foodstuff
That was what I was trying to show -- if you choose your base unit small enough (seconds, milliseconds, whatever), you can do fractional calculations using only integers, and still output the results as a proper floating point number.Ankylosis
I can certainly see that, and seconds works perfectly. However, when converting back to hours (which the software we use depends on; this isn't our design) won't it discard the fractional component?Foodstuff
Doesn't my example demonstrate that it works? Basically, first I calculate the whole part, then in a separate calculation the fractional part, and finally I use printf to print it with proper formatting. // Though I see from you other comments that you've already rewritten the program in Python. That's certainly the technically better solution. I just wanted to show that, if was unavoidable, the task could be done using only integer arithmetic.Ankylosis
Yes, after reading your edit I can see that you were right all along. My bad. Bash isn't my strong suit.Foodstuff
O
-2

If all these computers are unices, and they are expected to perform floating point computations, then each of them must have some fp capable app available. So a compound compound command along the lines of bc -l some-comp || dc some-comp || ... || perl some comp

Oquendo answered 2/7, 2014 at 19:32 Comment(1)
Or perhaps even echo "$HOURCOUNT $INCREMENT" | awk '{printf "%.2f", $1 + $2}'. The problem is time. I can pipe output from awk into Bash, but it takes forever.Foodstuff

© 2022 - 2024 — McMap. All rights reserved.