## ARSC HPC Users' Newsletter 348, September 15, 2006

### Happy 50th Birthday

A dear friend to all computer users celebrated its 50th birthday this week. For some, its occasional flakiness has tested our resolve. However, we wouldn't be where we are today without its contributions.

On September 13th, 1956, IBM shipped the first magnetic disk drive, a huge 4.4 MB disk, to Zellerbach Paper in San Francisco.

"Huge" is the operative term. The unit was five feet tall, 5'8" long and 29" deep. It had 50 24" platters for storage but only a single pair of read/write heads, one for the 50 top sides of the disks, one for the 50 bottoms. Its maximum seek time was 0.7 seconds and its transfer rate was about that of a dial-up modem download today.

In honor of the event, Betty Higbie baked and decorated a cake resembling a disk pack. (Thanks, Betty!) Said cake was consumed by ARSC staff and student assistants before lunch. Here's a silent movie of the RAMAC 305 in action .

### How to Beat Moore's Law: An Optimization Story in 6 Acts - Part II

[ Many thanks to Tom Logan of ARSC for the conclusion of this thriller! ]

 ACT V Directives Save The Day ACT VI Conclusions of the Super-Linear Kind

As you'll recall from > ACTs I - IV of this story, I was tasked with optimizing a tsunami code to reduce the 49 hour estimated run time to something practical. After compiler optimizations and hardware upgrades failed to help, I profiled the code to discover that 91% of the run time was spent in one routine. We pick up the story here...

ACT V: Directives Save The Day

Upon closer examination of the source code, I see 2 doubly nested do loops:

```
Loops on lines: 709/711 - 730/731

The variables being set here, ip1, jp1, jm1, tot_n, xm are all
temporary, only used inside this loop.  The real array being
calculated is l%m(i,j,2) which is NEVER referenced in these loops.
What is referenced is l%m(i,j,1).  This is what the compiler sees
as a dependency.  I can see that it is not, so I can insert some
directives to the compiler that tells it so.

Loops on lines: 733/735 - 754/755

This is pretty much the same story as the previous loop set.
Variables being set here are jp1, im1, ip1, tot_m, and xn, all of
which are temporaries used only in this loop set.  The real array
being calculated is l%n(i,j,2), which is never referenced in the
loops.  Again, l%n(i,j,1) is referenced and this is the compiler's
problem.  Inserting a couple of directives should fix the problem.
```

Of course, as usual for me, I couldn't remember the exact syntax of the directive that I needed. I just knew that it had something to do with 'independent' loops. After a brief search for that word in the IBM Fortran Language Reference guide, I found what I was looking for. The new routine looks exactly as before, except 1 directive has been added 4 times:

```
!-----------------------------------------------------------------------
subroutine momt_s (l,layid)
! ....Solve momentum equation (linear) in spherical coord.
!     layid = 1, outest layer
!     otherwise, inner layer
!-----------------------------------------------------------------------
use layer_params
type (layer)      :: l
integer       :: layid
!      real z(ix,jy,2),m(ix,jy,2),n(ix,jy,2),h(ix,jy)
!      real r2(ix,jy),r3(ix,jy),r4(ix,jy),r5(ix,jy)
data eps/1.0e-6/, zero/0.0/, twlvth/0.08333333333333/
!
ixm1 = l%ix-1
jym1 = l%jy-1
is = 2
js = 2
if (layid .eq. 1) then
is = 1
js = 1
end if
!IBM*  INDEPENDENT
do i=is,ixm1
ip1 = i+1
!IBM*  INDEPENDENT
do j=js,l%jy
if ((l%h(i,j).gt.zero) .and. (l%h(ip1,j).gt.zero)) then
if (j .le. js) then
jm1 = js
else
jm1 = j-1
endif
if (j .ge. l%jy) then
jp1 = l%jy
else
jp1 = j+1
endif
tot_n = l%n(i,j,1)+l%n(ip1,j,1)+l%n(i,jm1,1)+l%n(ip1,jm1,1)
xm = l%m(i,j,1)-l%r2(i,j)*(l%z(ip1,j,2)-l%z(i,j,2))+l%r3(i,j)* &
tot_n-l%r2(i,j)*twlvth*((l%z(ip1,jp1,2)-2*l%z(ip1,j,2)+    &
l%z(ip1,jm1,2))-(l%z(i,jp1,2)-2*l%z(i,j,2)+l%z(i,jm1,2)))
if (abs(xm) .lt. eps) xm = zero
l%m(i,j,2) = xm
else
l%m(i,j,2) = 0.0
end if
end do
end do
!
!IBM*  INDEPENDENT
do j=js,jym1
jp1 = j+1
!IBM*  INDEPENDENT
do i=is,l%ix
if ((l%h(i,j).gt.zero) .and. (l%h(i,jp1).gt.zero)) then
if (i .le. is) then
im1 = is
else
im1 = i-1
endif
if (i .ge. l%ix) then
ip1 = l%ix
else
ip1 = i+1
endif
tot_m = l%m(im1,j,1)+l%m(im1,jp1,1)+l%m(i,j,1)+l%m(i,jp1,1)
xn = l%n(i,j,1)-l%r4(i,j)*(l%z(i,jp1,2)-l%z(i,j,2))-l%r5(i,j)* &
tot_m-l%r5(i,j)*twlvth*((l%z(ip1,jp1,2)-2*l%z(i,jp1,2)+    &
l%z(im1,jp1,2))-(l%z(ip1,j,2)-2*l%z(i,j,2)+l%z(im1,j,2)))
if (abs(xn) .lt. eps) xn = zero
l%n(i,j,2) = xn
else
l%n(i,j,2) = 0.0
end if
end do
end do
!
return
end
```

When I recompile and look at the .lst file now, I see (output modified to fit in 80 columns):

```
Source    Loop Id    Action / Information
Line
------    -------    ----------------------------------------------------------
736             1    Private variables recognized in this loop nest:
"@CIV2", "xn", "xn", "@CIV2", "xn", "xn", "@NewIV1",
"@NewIV0", and "j".
736             1    Loop has been automatically parallelized.

Source    Loop Id    Action / Information
Line
------    -------    ----------------------------------------------------------
736             1    Outer loop has been unrolled 2 time(s).
736             1    Private variables recognized in this loop nest:
"@DCIV1", "xn", "xn", "@CIV2", "xn", "xn", "@CIV2",
"xn", "xn", "@NewIV1", "@NewIV0", and "j".
736             1    Loop has been automatically parallelized.

Source    Loop Id    Action / Information
Line
------    -------    ----------------------------------------------------------
739             1    Private variables recognized in this loop nest: "xn",
"xn", "xn", "i.rnn2FF", "@NewIV0", and "i".
739             1    Loop has been automatically parallelized.

Source    Loop Id    Action / Information
Line
-------   --------   ----------------------------------------------------------
710             1    Private variables recognized in this loop nest:
"@CIV0", "xm", "xm", "@CIV0", "xm", "xm", "i.rnn2FF",
"@NewIV3", "@NewIV2", and "i".
710             1    Loop has been automatically parallelized.

Source    Loop Id    Action / Information
Line
-------   -------    ----------------------------------------------------------
713             1    Private variables recognized in this loop nest: "xm",
"xm", "xm", "@NewIV2", and "j".
713             1    Loop has been automatically parallelized.

Source    Loop Id    Action / Information
Line
------    -------    ----------------------------------------------------------
710             1    Loop interchanging applied to loop nest.
710             1    Outer loop has been unrolled 2 time(s).
710             1    Private variables recognized in this loop nest:
"@DCIV4", "xm", "xm", "@CIV0", "xm", "xm", "@CIV0",
"xm", "xm", "i.rnn2FF", "@NewIV3", "@NewIV2", and "i".
710             1    Loop has been automatically parallelized.
```

I don't know why the file has redundant entries, or exactly what these new variables are (@CIV2, @NewIV1, ...). But, I do see that the loops have been parallelized and that in each loop private variables have been recognized by the compiler for those loops.

Upon re-running the modified code, my fears were finally abated. The new code ran 100 iterations per minute with just 4 threads. Using 8 or 16 threads only made it better, with a best run giving nearly 143 iterations per minute when using 16 threads on the power5 node with AIX 5.3.

As a final experiment, I back ported my modifed routine onto a power4 node of Iceberg. It worked beautifully, giving me just under 91 iterations per minute - plenty fast enough to get my full 24780 iterations in the standard 8-hour queue on that system.

Before reaching any amazing conclusions, I had one more thing to do. Verify that the directives I'd introduced to the code did not alter the outputs. I'll admit that after my incredible success I was very nervous to proceed along these lines. However, correct science is certainly more important than quick science and I eventually choked back my trepidity. Comparison of the intial (serial) outputs with the final (auto parallelized with directives) outputs showed no difference. I was in the clear!

ACT VI: Conclusions of the Super-Linear Kind

To summarize the story, I needed to run 24780 times steps of my code. Using 1 iceberg p655 node, I was only able to complete 4051 iterations in the 8 hours allowed by our standard queue. By moving to the a power5 node on iceflyer, I was able to get about 15,000 time steps in the 16 hours allowed on those queues. Finally, by profiling my code and adding only one directive 4 times into a single routine, I was able to get all 24780 iterations in just 167 minutes on a power5 node and 235 minutes on a power4 node. The following graph illustrates my optimization progess.

Speed-up analysis gives:

```
power4 -> power5                        ~ 2 times
power4 -> power4 w/ directives          ~ 11 times
power5 -> power5 w/ directives          ~ 9 times
power4 -> power5 w/ directives          ~ 17 times
```

Since each of the power nodes has just 8 processors, and adding in the directives gave me 11 and 9 times speed-up respectively, this optimization exercise has given the always desired and often envied super-linear speed-up. All by adding one directive 4 times.

For me, the main moral of the story is that Moore's law is nice, but a whole lot more can be accomplished by a bit of profiling and, in some cases, a couple of simple directives. In addition, with a little help, auto parallelization on our IBM systems can be very effective. Finally, last but certainly not least, when seeking optimization, ALWAYS PROFILE YOUR CODE FIRST!

### Java for Fortran Programmers: Part II

[[ Thanks to Lee Higbie of ARSC for this tutorial. ]]

This is the second in a series of articles, presented as a tutorial, for scientists and engineers. First off, thank you to those who communicated an interest in this series. We appreciate your input and feedback!

The topic today is:

• How the OOP Mindset Differs from that Usual for Fortran Programmers

In the imperative languages common in supercomputing, the programmer generally thinks primarily about the control flow of the program and the layout of arrays of data. For example, the temperature in a weather program will be one (or several) 3D array(s) of floating point temperatures. The Fortran compiler will spend extensive resources trying to produce efficient code operating on these arrays. It may analyze blocks of hundreds of source code statements to improve the way it uses cache, registers and arithmetic units.

For making weather maps let me assume that various prognostic variables are passed to a mapping routine. This separate routine will create the charts we see in weather forecasts. The code would be something like:

```
call weather_map (map_time, temperature, pressure, wind_velocity)
```

Notice that the syntactic emphasis is on the verb, "call," the imperative.

For object-oriented programming, the programmer generally focuses on the classes, the data structures and the things they do. Often the program is event-driven, so the control flow is complex and not generally knowable by the programmer. My impression is that the Java compiler spends its time ensuring that the operation is correct. Java compilers, for example, will not compile code with uninitialized variables or mismatched argument/parameter types.[*]

A Java weather forecasting program, which is not currently realistic because Java codes run so much slower than Fortran, is more likely to have a Weather class that includes data structures for, say, temperature, pressure, and wind velocity. One function of this class would be to generate a weather map. The call to produce the weather map would be passed only the time because the class instance itself would contain all the weather data and the class would contain a method to create the map. The Java might be:

```
weather.createMap (mapTime);
```

Here the weather object includes the temperature, pressure and wind velocity as well as whatever other types of data it may need. In this case the syntactic emphasis is on the object weather. The programmer thinks of the method createMap as operating on the object weather, which is an instance of the class Weather (Java is case sensitive).

This syntactic distinction deserves more explanation. The createMap method (function) requires the name of the object it operates on, weather. In Fortran and other imperative languages, the data structure is just another parameter for the procedure call. In those languages there is no ownership of the function by the data structure. The map generator will merrily generate a nonsense map from

```
call weather_map (map_time, pressure, temperature, wind_velocity)
```
where the pressure and temperature are interchanged.

In object oriented languages, the methods belong to the class for the object and generally work only on objects of that class. This means that the map generation method is syntactically in the same file as the data structure for the plot. Generic methods in Java, such as those for printing or computing the square root, use an abnormal syntax because they are action driven (conceptually), not object driven.

The example I plan to use in later chapters of this tutorial is the creation of Fortran compile statements from a GUI. For this I will eventually build up to using a CompilerOption, which includes Strings for the compile statement, the GUI Checkbox, the GUI popup balloon with an extended explanation, and the default setting of the compiler switch. This class will extend Checkbox so it will display as desired and will monitor itself so that when the mouse pointer hovers over it, it will popup its extended explanation.

One thing these CompilerOption objects cannot do is set other options or detect incompatibilities between them. That must be done by an object or class that is aware of multiple options. Each CompilerOption object, that is, each object that is an instance of the CompilerOption class, has all the information that it needs to behave, but knows essentially nothing about any of the other instances of the CompilerOption class. An object can know the information to set or clear other objects, but some other, higher level in the calling tree, object will have to do the work.

I mentioned the "calling tree" in the paragraph above. When creating a Java program there are typically three hierarchies that need to be kept in mind.

1. The "flow chart" hierarchy that shows which classes are referenced from each class.
2. The calling tree or control flow hierarchy that shows how the thread of control reached a point in the program. This is given by the traceback when a program aborts, for example. Every event, mouse action, button click, ..., starts a thread of control. Most Java programs react more to external events than by executing on their own.
3. The class hierarchy, which describes the syntactic relationship of the classes to their parents. For Java classes, this forms a tree (an acyclic graph) rooted at Object.

If you check the documentation or use a book on one of the Java packages, such as the abstract windowing toolkit (java.awt.* and java.awt.event.*), you will see the position in the class hierarchy for each class. For example, Checkbox is a subclass of Component which is a subclass of Object. Syntactically, Checkbox extends Component and any class that does not explicitly extend another class, implicitly extends Object. Because CompilerOption extends Checkbox, it inherits all its functionality and, by default, will be able to display itself as a Checkbox.

The primary use that I see for Java in scientific/engineering programming is for GUIs and control programs, programs that are event driven. Such programs are inherently more complex than control-driven programs--you cannot determine the execution thread through the code from the code itself, nor even from the code and data.

In this chapter I have described some of the major differences in the way you need to look at and think about these event-driven and object-oriented programs.

In Chapter 1 I promised a program. Here's HelloWorld in Java. Because HelloWorld does not explicitly extend any other class it is only a subclass of Object.

```
class HelloWord { //  Must be in file named HelloWorld.java
public static void main (String [] args) {
// required for main entry point
System.out.println ("Hello World");
/* This is atypical and rare syntax:  out = field
of System class that represents the always open
output stream println uses methods from PrintStream
class */
} // end of method main
}            // end of class HelloWorld
```

The Java loader starts application execution at a method called main that is public, static (not an object method), has void return type and whose argument is an Array of Strings

In the next chapter I will describe the important syntactic differences between Java and Fortran/C. Among other things this will elaborate on the class hierarchies, Java's keywords, and the syntax <object name> . <method name> (parameters)

--

[*] Footnote: A routine calls a subprogram with "parameters," but all the subprogram gets is "arguments." At least that's the way I use the terms.

### Quick-Tip Q & A

```
A:[[Here's a conditional statement grabbed from the (/bin/sh)
[[ configure script for mysql.  There are many like this:
[[
[[     if test X"\$mysql_cv_compress" != Xyes; then
[[         # ...do stuff...
[[     fi
[[
[[ For my scripts, the following style has always worked:
[[
[[     if [[ \$mysql_cv_compress != yes ]]; then
[[         # ...do stuff...
[[     fi
[[
[[ So, two questions:
[[   1) Why would the experts use "test" rather than the square
[[      bracket syntax?
[[   2) Why bother with that "X" ???
[[

#
# Thanks once again to Greg Newby:
#

>
> 1) Why would the experts use "test" rather than the square bracket
>    syntax?
>

I think this is just personal preference.  On systems I tested,
"test" and "[" are the exact same program:

% ls -li /bin/test /bin/[
448252 -r-xr-xr-x   2 root  wheel  18104 Mar 20  2005 /bin/[*
448252 -r-xr-xr-x   2 root  wheel  18104 Mar 20  2005 /bin/test*

There might be some shells with built-in "test" commands, which
could speed things up a bit.

>
>     2) Why bother with that "X" ???
>

Error handling.  If a variable is accidentally not set, then the syntax
would be bad.  For example, if \$mysql_cv_compress was not set in your
example, the shell would see:

if [[ != yes ]]; then

which would be bad syntax.  By putting the X there, the shell gets
this if the variable isn't set:

if [[ X != yes ]]; then

which is better.  Even better still would be to surround the variable
with quotes (the X can go inside or outside):

if [[ X"\$mysql_cv_compress" != Xyes ]]; then

so that values with spaces won't cause an error.  What if the value
for \$mysql_cv_compress were, "yes, it is" ?  If the shell sees this,
it won't know what to do with the extra words (since spaces are word
delimiters):

if [[ Xyes it is != Xyes ]]; then

Basically, the quotes and the extra character help to prevent the
script from bombing. Can this handle all problems?  Not with this
level of checking, but it's an improvement.

#
# Editor's Note:
#

Regarding the syntax, "[ condition ]" versus "[[ condition ]]":

My O'Reilly korn shell book claims that "[ condition ]"
is "old style" and equivalent to "test," as noted by Greg.  The bash
and ksh man pages basically agree on this: "Field splitting and file
name generation are not performed on the words between [[ and ]]."
The book suggests using the "[[ condition ]]" syntax to "make quoting
less necessary."

Q: I'm curious what other people use for their shell prompt setting.
(Your prompt is set via the "PS1" environment variable under
sh/ksh/bash and via the "prompt" variable under csh/tcsh.)```

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
E-mail Subscriptions:
Archives: