## ARSC HPC Users' Newsletter 250, July 22, 2002

### Bicenquinquagennial Issue

As you may have noticed, this is the 250th issue of the ARSC Users' Newsletter! We dispensed our first issue nearly 8 years ago, on August 25, 1994. All past issues are on-line, at:

http://www.arsc.edu/support/news/HPCnews.shtml

Past "Quick-Tips" are all indexed at:

http://www.arsc.edu/support/news/qtindex.xml

However... This might be YOUR first issue, since I just updated our mailing list with new ARSC users. (If you'd rather not receive this newsletter, unsubscribing is described at the bottom.)

### Optimizing with IBM Vector Intrinsics and "xlf -qhot"

Intro

IBM's massv library is available on icehawk for hand-tuning of vectorizable math intrinsics. XLF (with options, -qhot -O3 ) does a great job of detecting vectorizable operations and may actually do all the work for you. In either case, the vector intrinsics can really speed things up.

As a quick example, this loop:
```
do ni = 1,nint
loghaz(ni) = dlog(haz(ni))
enddo
```
can (and should) be replaced by:
```
call vlog (loghaz, haz, nint)
```

The vector intrinsics library is called "massv". ("V" for "vector" and "MASS" for "Mathematical Acceleration SubSystem".) There's also "massvp3," tuned for the Power3, and the basic "mass" library, for scalar operations.

On icehawk, these libraries are available in:

/usr/local/pkg/mass/current/lib

/usr/local/lib

/usr/local/pkg/mass/current/wrk

Here are excerpts from the README file. The list of vector math intrinsics, performance data, and explanations/caveats:
```
----------------------------------------------------------------------

Math Library Performance
(cycles per evaluation, length 1000 loop)

604E                630               P2SC
function range libm mass massv libm mass massv vp3  libm mass massv
vrec    D     32*       10     9*        6    4     8*        4
vsrec    D     18*        8     7*        5    3     8*        4
vdiv    D     32*       12     9*        7    5     9*        5
vsdiv    D     18*       10     7*        6    3     9*        4
vsqrt    C     67   48   16    11*        9    6    13*        7
vssqrt    C     70   48   10     7*        8    5    13*        5
vrsqrt    C     79   49   16    22*        9    6    22*        7
vsrsqrt    C     83   51    9    16*        7    4    22*        5
vexp    D     83   45   16    64   33    6         53   21    7
vsexp    E     85   44   13    68   36    5         58   21    6
vlog    C     99   56   20    83   53    8         67   35    8
vslog    C    102   56   17    86   57    7         66   37    7
vsin    B     50   29   11    36   16    5         34   17    5
vsin    D     79   59   27    60   43   12         50   37   12
vssin    B     51   26    8    39   18    4         40   16    4
vssin    D     79   58   20    62   46    9         56   38    9
vcos    B     51   26    9    37   16    4         34   17    4
vcos    D     75   59   27    58   43   12         51   36   11
vscos    B     52   26    7    39   18    3         40   16    3
vscos    D     76   59   20    61   46    9         56   37    9
vsincos    B    100   53   19    80   33    8         80   38    8
vsincos    D    151  116   29   123   92   12        111   81   12
vssincos    B    107   55   15    79   38    6         78   36    7
vssincos    D    159  118   24   125   98   10        110   80   10
vcosisin    B    104   55   19    78   34    8         79   37    8
vcosisin    D    156  118   29   123   93   12        111   81   12
vscosisin    B    108   55   15    79   36    6         78   36    6
vscosisin    D    160  119   23   125   95    9        110   79   10
vtan    D    136   74   32   111   52   19         90   38   13
vstan    D    136   74   32   113   56   19         95   39   12
vatan2    D    545  104   40   413   87   25        555   73   17
vsatan2    D    545  104   40   418   89   25        558   71   17
vdnint    D     37   22    7    24   12  3.4         23   13  2.7
vdint    D     36         6    22       2.8         21       2.6
massvp2
vidint    D                                          4.0      2.7
vasin    B                                           48       17
vacos    B                                           49       17
vdfloat    D                                          3.0      1.8
vdsign    D                                            9      3.5

* indicates inline instructions timed (not a subroutine call)

Range Key     Processor     Cycle time         Dcache size
A =    0,  1     604E    3.0 nanoseconds      32 kilobytes
B =   -1,  1      630    5.0 nanoseconds      64 kilobytes
C =    0,100     P2SC    7.4 nanoseconds     128 kilobytes
D = -100,100
E =  -10, 10

"[This] data should be considered approximate. It was obtained by
timing many repetitions of a loop over 1,000 random arguments and
includes all overhead. Timing in this way will bring the input and
output vectors into the on-chip cache (the loop is short enough for
them to fit in cache). Performance may deteriorate seriously when the
input and output vectors are not in cache. Performance may also
deteriorate for arguments at or near the end-points of the valid
argument ranges."
----------------------------------------------------------------------

Example
For a more informative example, here's a routine from a simulation code
under evaluation at ARSC. (A driver for this routine was constructed
from the original code, and used for the timings given below.) The array
length of 2000 is a do-not-exceed size... in the current data set,
"nint" is only 352.

```
```

integer ni,nint
double precision beta,cumhaz,haps,haz,prit,rate,
&  total
double precision time(2000)

prit = 0d0
total = 0d0

do ni = 1,nint
haps = dble(nint)-dble(ni)+2d0
rate = haps*(haps-1d0)/2d0
haz = rate*dexp(beta*(total+time(ni)))
cumhaz = rate*dexp(beta*total)*
&      (dexp(beta*time(ni))-1d0)/beta
prit = prit+dlog(haz)-cumhaz
total = total+time(ni)
enddo
```
```
Basically, this loop iterates over the array "time", performing multiple

dexp
's and a
dlog
operation, accumulating and storing intermediate
results to the scalar variables.

The loop can be restructured by hand to replace
dexp
with
vexp
.  (
dexp

is double precision and
vexp
works on doubles, so this is
appropriate.)

In the rewritten code, temporary arrays store entire sequences of scalar
values, allowing these sequences to be processed all at once by calls to

vexp
.  Obviously, the temporary arrays consume memory, and if the
default size in this example were 2,000,000 rather than 2,000, we'd have
a problem.  On the other hand, I've given them self-documenting names
and not attempted to reuse them, so this could be tighened up.

Here's the hand-optimized version:

```
```

integer ni,nint
double precision beta,cumhaz,haps,haz,prit,rate,
&  total
double precision time(2000)

double precision :: v_haz(2000)
double precision :: v_cumhaz(2000)
double precision :: v_beta_total_time(2000)
double precision :: v_dexp_beta_total_time(2000)
double precision :: v_total(2000)
double precision :: v_beta_time(2000)
double precision :: v_dexp_beta_time(2000)
double precision :: v_dlog_haz(2000)
double precision :: v_beta_total(2000)
double precision :: v_dexp_beta_total(2000)

prit = 0d0
v_total(1) = 0d0

do ni = 2,nint
v_total(ni) = v_total(ni-1) + time(ni-1)
enddo

do ni = 1,nint
v_beta_total_time(ni) = beta * (v_total(ni) + time(ni))
v_beta_time(ni) = beta * (time(ni))
v_beta_total(ni) = beta * v_total(ni)
enddo

call vexp (v_dexp_beta_total_time, v_beta_total_time, nint)
call vexp (v_dexp_beta_time, v_beta_time, nint)
call vexp (v_dexp_beta_total, v_beta_total, nint)

do ni = 1,nint
haps = dble(nint)-dble(ni)+2d0
rate = haps*(haps-1d0)/2d0
v_haz(ni) = rate*v_dexp_beta_total_time(ni)
v_cumhaz(ni) = rate*v_dexp_beta_total(ni)*
&                  (v_dexp_beta_time(ni)-1d0)/beta
enddo

call vlog (v_dlog_haz, v_haz, nint)

do ni = 1,nint
prit = prit+v_dlog_haz(ni)-v_cumhaz(ni)
enddo
```
```
This is a tedious procedure, but it yielded a speedup of over 3x.  Here
are wallclock times (average of four runs) for the two versions of the
routine, as run out of the driver:

```
```

Version of Code
=======================
Original       Hand
vectorized
xlf options   (secs)       (secs)
-----------  --------     --------
<default>    3.39         0.92
-O3          3.18         0.83
-O3 -qhot    0.93         0.86

```
```
It's somewhat reassuring to see that the hand-coded version managed to
beat the best compiler time in all cases.  Even better, since it's a lot
of work to do this by hand, is the observation that with the "high order
transformation option", the compiler does almost as well.  (Different
codes will, of course, respond differently.)

Combining manual and compiler optimization
Here's an approach to optimizing a code with massv:

```
1. compile with -O3 -qhot ,
2. verify that program output unchanged or acceptable--reordering execution can change results,
3. profile the code,
4. identify math intrinsics (if any) which take significant time,
5. find the routine(s) where such intrinsics are called,
6. determine if XLF has transformed them into vector intrinsics,
7. if not, attempt to hand-optimize.
```
Steps 3 and 6 require additional tools, as follows.

Profiling codes on the SP```
1. Recompile with the -pg option: xlf -pg -O3 -qhot ...
2. Run the executable as usual
3. This produces the trace file: gmon.out
4. Use gprof to view the profile: gprof executable_name gmon.out
```
-pg

before doing any production runs! ]

Here's a snippet of
gprof
output, from the "call graph profile:"

```
```

called/total       parents
index  %time    self descendents  called+self    name           index
called/total       children

0.05        0.00  515084/4435146     .endpolpr [17]
0.05        0.00  553902/4435146     .endpolpl [15]
0.26        0.00 2836611/4435146     .endpolp2 [3]
[13]     4.7    0.40        0.00 4435146         ._log [13]
0.00        0.00      12/23          ._Errno [193]
```
```
This calling tree tells us that
log
is called by
endpolpr
,

endpolpl
,
endpolp2
and that
log
takes 4.7% of the code's total run
time.  (4.7% may not be worth worrying about.)  It also tells us that

log
calls
Errno
.

From this, we know to examine the potential of subroutine
endpolp2
for
the replacement of
log
with
vlog
.  To avoid wasting time, we must
first determine if XLF has already done this, as described next.

Locate vector intrinsics added/missed by the compiler```
1. Obtain a compiler report by passing XLF the "-qreport" option: xlf -qreport -O3 -qhot ...
2. view the ".lst" file produced by the compiler.
```
Here's a snippet from a report. It ain't perty, but you can find
occurances of both
_exp
and
CALL __vexp
" in this transformed code.
Remember, in the source code,
vexp
" didn't exist... the compiler has
rearranged the loops to replace
exp
with
vexp
.

```
```

@CSE23 = dr(1)
@CSE24 = _exp(-(rrate * %VAL(@CSE23)))
prob[].off0 = p1 * ( 1.0000000000000000E+000 - @CSE24)
@CSE25 = _exp(-(rrate2 * %VAL(@CSE23)))
prob2[].off0 = p1 * ( 1.0000000000000000E+000 - @CSE25)
temp2 = @CSE24
temp4 = @CSE25
@MARKSTK0 = __getstack()
GOTO lab_83
2913
lab_83
2886
IF ((@ICM6 > 0)) THEN
2893
@NumElements0 = int(int(@ICM6))
&      + (8)*(0)),@NumElements0)
2894
@NumElements1 = int(int(@ICM6))
&      + (8)*(0)),@NumElements1)
2886
@CIV4 = 0
Id=11        DO @CIV4 = @CIV4, int(@ICM6)-1
2898
temp2 = temp2 * @CSE31

```
```
The compiler report also provides some annotation in english, as shown
in the following snippet:

```
```

Source  Source  Loop Id  Action / Information
File    Line
-----   ------- ------- -----------------------------------------
1    2893            Vectorization applied to statement.
1    2894            Vectorization applied to statement.
1    2886     11     The loop on line 2886 was created by the
distribution of the loop on line 2886.
```
```
Given this information, the programmer's goal is to find occurances of

_exp
within the (transformed) loops.  If found, we know that XLF was
unable to "vectorize" those loops, and thus, the corresponding loops in
the original source might possibly be hand-optimized.  Loops are
identified by the source code line numbers given in column 3 of the
tranformed code in the report.

In this example, and, in fact, for the complete application code, XLF
"vectorized" every occurance and left nothing to do by hand.

```

## BLUI in SIGGRAPH Studio

```
ARSC/UAF's "Body Language User Interface", or BLUI, project will be
featured in the "Studio" at SIGGRAPH, next week.

Here's the blurb from the Studio web page (under the new category, "VR,"
at the bottom):

```

http://www.siggraph.org/s2002/conference/studio/index.html

VR "New for SIGGRAPH 2002, this area features a system for immersive display configured for 3D solid modeling. Bill Brody of the University of Alaska at Fairbanks demonstrates his "BLUIsculpt" system, in which fully 3D objects can be created and output as .stl files for rapid prototyping."

### DOE Benchmarking

```
Interesting work DOE's benchmarking of early systems:
```

http://www.csm.ornl.gov/evaluation/index.html

```

Evaluation of Early Systems

Computational requirements for many large-scale simulations and ensemble
studies of vital interest to the Department of Energy (DOE) exceed what
is currently offered by any U.S. computer vendor. Examples are numerous,
ranging from global change research to combustion to informatics. It is
incumbent on DOE to be aware of the performance of new or beta systems
from high performance computing vendors that will determine the
performance of future production-class offerings. It is equally
important that DOE work with vendors in finding solutions that will
fulfill DOE's computational requirements.

In support of this mission, Oak Ridge National Laboratory (ORNL) is
currently performing in-depth evaluations of a number of high
performance computer systems,

```

### Fortran Information

```

Learn about all things Fortran from Michael Metcalf's Fortran 90/95/HPF
Information File, at:
```

http://www.fortran.com/metcalf.htm

```

```
• WHERE CAN I OBTAIN A FORTRAN 95 COMPILER?
• OTHER USEFUL PRODUCTS
• WHAT BOOKS ARE AVAILABLE? In these languages:
• Chinese
• Danish
• Dutch
• English
• Finnish
• French
• German
• Italian
• Japanese
• Russian
• Swedish
• WHERE CAN I OBTAIN COURSES, COURSE MATERIAL OR CONSULTANCY?
• WHERE CAN I FIND THE FORTRAN AND HPF STANDARDS?

### Quick-Tip Q & A

```

A:[[ June is mosquito month in Fairbanks, and 2002 has been impressive by
[[ all accounts.  Send us your favorite (short) mosquito story, remedy,
[[ or advice.  Any luck with mosquito traps, DEET-free dope, or personal
[[ concoctions?```
```
# The Gadget Award goes to Kate Hedstrom:```
 We got a "SonicWeb" trap, shown here. It has a heartbeat sound that is loudenough to hear. It also uses heat and Octenol to attract critters. Ours has trapped all sorts ofinsects, mostly flies, wasps and itty-bitty things. We also caught a dragonfly and a fewmosquitos.
```

My favorite mosquito story was when I was camping in college with a
Buddhist friend of mine.  Angrily slapping mosquitos left and right, she
implored me, "Don't kill them, just brush them away.  They just want a
drop of your blood, and you want to take their life."  I pondered this a
few days and then asked her if she was reincarnated as a mosquito,
whether she might appreciate being sent on to her next life all the
sooner.  She had to agree that that sounded attractive... :)

But in spite of this funny exchange, she truly believed that if you
brush mosquitos away rather than swat at them, then they will leave you
alone.  And ever since then, I have brushed mosquitos away, and don't
remember the last time I was plagued with as many bites as I was when I
was a kid.

Interesting fact that I did not know:  According to Webster on-line, the
plural of mosquito is either -os or -oes.  Lucky Dan Quayle.

# From one of the editors:
The EPA says DEET products are safe ("when used as directed"), but a
quick search on "DEET" && "Gulf War Syndrome" may give you pause.  I do
my best to avoid it. Head nets and long sleeves are the best.  On the
other hand, I'd *never* go hiking, fishing, etc., without some stron bug
dope in my pack.  Swarming mosquitos can drive a person crazy, and make
you do things more dangerous than just wearing DEET.  This summer, I've
used DEET just to work in the yard, wanting to avoid smacking myself in
the head with a shovel in an attempt to swat some bug.

# Tom Logan deserves some award for this...

You probably know the mosquitos in Alaska are big, but the other night I
overheard this from two that we're buzzing around my bed:

Mosquito 1:
"I'm tired of eating out, lets pick him up and take
him back to the swamp"

Mosquito 2:
"NO! When we get back, the big ones will take him away
from us!"

Q: I received a "not enough memory" error when trying to compile a large
subroutine (part of a big code) on the T3E with -O3,unroll2 . Is
there any way to increase the memory allocation to f90 or am I stuck
compiling that subroutine with -O2 ? Thanks!
```
```
[[
Answers, Questions, and Tips Graciously Accepted
]]```

Current Editors:
 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
E-mail Subscriptions:
Archives: