## Contents

### Floating-point Fun in Fortran

[ Thanks to Jeff McAllister of ARSC for submitting this article. ]

I recently started working on a ray-tracing code which is quite sensitive to how floating-point numbers are represented in memory. If an angle off by even .05 degrees, it could mean missing by kilometers at the end of the ray.

Aside from anecdotes like "floating-point math is all approximations" and "using very large things and very small things together can be problematic," the details about exactly when difficulties would occur seemed fuzzy. Without some hard numbers to work with I could only guess what was causing the code to misbehave.

This article reports on some handy Fortran 90 inquiry functions, numeric attributes of ARSC systems, lessons learned, and ideas for debugging and testing.

### Range and Precision

Two Fortran 90 functions turned out to be quite useful for debugging:

range(X):

Gives a quick estimate of the highest order of magnitude (plus or minus) available. Thus it's possible to know the values beyond which X will become infinity or zero (and suddenly start propagating NaN's where they didn't before, but more on that later).

precision(X):

Tells you,

1. the number of meaningful digits after the decimal point
2. how far apart two numbers can be (in orders of magnitude) and still have a meaningful result produced when one of the numbers is being used to increment or decrement the other.

### Kind

I'd considered the usual advice that if I can't change the math, I need to increase the precision of the calculation: that if the numbers keep "acting funny," crank precision up some more. Here's a summary of the size of Fortran reals available at ARSC:

```
SGIs:           (default 4),8,16.  double precision=r8
icehawk:        (default 4),8,16.  double precision=r8
yukon:          4,(default 8)      double precision=r8
chilkoot:       (default 8),16     double precision=r16
(r4 on the SV1 appears to be a "crippled" r8, using the same
hardware with lower precision for compatibility.)
```

The above is a list of "kind" values available. That is, a real can be declared real(kind=K)::X in F90 or real*K in F77. Kind corresponds to bytes per variable, so these declarations could be referred to as 32, 64, or 128-bit reals. "Double precision" is also available, but as you see it's just "mapped" to one of the real types and isn't necessarily the highest precision.

### ARSC Systems and Performance

Unfortunately, the current set of hardware is built to handle only 32 or 64-bit words of memory. The highest level of precision can be had only with a huge performance hit, so cranking up the precision isn't always a reasonable option, and must be done with performance in mind.

Here's a table showing the range, precision, and relative performance of different real sizes as run on the ARSC systems. The performance measurements were obtained on runs of a simple integration code (given below), but agree with timings of the ray-tracing code that motivated this work.

```
Kind  =  4               8               16
====            ====            ====
SGI octane
range          37              307             275
precision      6               15              31
runtime        1x              1x              40x

icehawk
range          37              307             291
precision      6               15              31
runtime        1x              1x              10x

chilkoot
range          2465             2465           2465
precision      6                13              28
runtime        1x               1x              10x

yukon
range          37               307    mapped to 8
precision      6                15
runtime        1x               1x
```

The precision and range available on the various systems is due to use of different floating-point formats (also the source of binary file compatibility woes).

### Portability

Fortran 90 provides a way to write portable code in spite of the variations in range and precision. The 'kind' value can be a Fortran 90 parameter, set at runtime by the function,

```
selected_real_kind (Precision,Range)
```

Say I need to do some "adding really small things to really big things", separated by at most 14 orders of magnitude. At the beginning of the code I would declare,

```
integer,parameter::k=selected_real_kind(p=14)
```

On every ARSC platform but the SV1, the code would compile with k=8, while k=16 would be selected automatically on the SV1 because the SV1's r8 format only allows a precision of 13 digits/orders of magnitude. To declare the variables, I could then use, for instance,

```
real (kind=k) :: X
```

### Real consequences, NaN

When numbers fall off the edge of a floating point format strange things start happening. Often this will lead to a crash or hang (which at least indicates that something is wrong). More dangerous is case when normal execution continues, but produces incorrect results.

The IEEE format (used by the SGIs and IBM) deals with things like the inevitable undefined situation more "gently" than, say, the Cray format which demands a crash when a floating point exception occurs. Under IEEE, undefined numbers become "NaN", or "Not a Number". These don't crash the code, instead propagating NaN status to every calculation they are involved in.

As it can be quite difficult to figure out where the first NaN came from, it's useful to test whether a number is NaN or not. Here's a method which at least works on the ARSC systems that generate NaNs:

```
if (int(nan)==int(nan-2.0)) then
print *,"nan detected"
end if
```

This uses 2 features of the IEEE format:

1. The result of any operation involving NaN is NaN
2. NaN, when converted to an integer, is no longer NaN (and uncomparable) but the maximum integer.

These conditions can be tested in a more intuitive fashion -- if they don't cause a crash. Under IEEE, if X grows larger than (+/-) huge(X) (another nifty query function) it becomes (+/-) infinity. This is similar to NaN in that infinity +,-,/,* anything is infinity. A number smaller than tiny(X) becomes zero.

### Precision

There's another twist: a "big" number, like 10^30 can become effectively zero.

It may seem that if abs(X) can be anywhere from tiny(X) to huge(X), there isn't a lot to worry about. Especially if epsilon(X) -- the smallest testable difference between two numbers is pretty small.

If all this is true, then why would this code

```
X=huge(X)
X=X-10**N
```

where N is something big -- say, 30 -- still equal huge(X) on one of the SGIs where a real word is 32 bits?

Looking at the table above, you can see that the SGI 4-byte real only offers six digits of precision. So, it would follow that if log10(huge(X)) is 38, N would need to be around 32 to make a measurable difference when added to such a large number. This is the situation in which 10^30 acts like zero.

Now the situation has clarified into something which can be tested. In general (though there's still a bit of fudging due to binary vs decimal differences):

```
If log10(biggest)-log10(smallest)>precision
```

the result won't be what you intended.

### Putting it all together:

The actual code I was having problems with is far too big for this newsletter, but with the techniques above I was at least able to debug the thing and make it able to provide meaningful error messages.

Here's an example code which I've used to play with some of the concepts I've described. It does simple numerical integration, which can stress the numeric model if the interval width is much, much smaller than the integration interval, and more importantly, it dumps information about the numbers involved.

```
!
! Module mod1
!
!   - set real kind parameter, rkind.
!   - define the function to integrate.
!
module mod1
integer,parameter::rkind=selected_real_kind(p=6)

contains

function F(X) result(Y)
real(kind=rkind) X,Y

Y=X
end function F

end module mod1
!
! program integral
!
!   - set integration interval and interval width
!   - perform the integration
!   - show properties of the numbers
!
program integral
use mod1

implicit none

integer:: prec
real(kind=rkind):: begin,end,X,Y,area,delx,oldX

area=0.0

begin=0.0
end=1.0
delx=.000001

print *,"kind:",kind(delx)
print *,"range:",range(delx)

prec=precision(delx)

print *,"precision:",prec
print *,"log10 delx=",log10(delx)
print *,"log10 end=",log10(end)

if (prec<(int(log10(end)-log10(delx)))) then
end if

print *,"Integrating f(x) from ",begin," to ",end," delx:",delx

X=begin
DO
area=area+(F(X)*delx)
oldX=X
X=X+delx
if (.not.X>oldX) then
print *,"X not incrementing!  Aborted at X=",
&                X, "area aborted at=", area

stop
end if
if (X>end) exit
END DO

print *,"Area under F(X):",area
end
```

Here are results from three runs on an SGI workstation in which I tweaked delx to show a good run (result produced with no warning), incorrect run (result and warning produced), and aborted run. The second, incorrect run, is obviously the most dangerous situation:

```
---
Run with delx=.000001
---
kind: 4
range: 37
precision: 6
log10 delx= -6.
log10 end= 0.E+0
Integrating f(x) from  0.E+0  to  1.  delx: 9.999999975E-7
Area under F(X): 0.493828773

---
Run with delx=.0000001
---
kind: 4
range: 37
precision: 6
log10 delx= -7.
log10 end= 0.E+0
Integrating f(x) from  0.E+0  to  1.  delx: 1.000000012E-7
Area under F(X): 0.451817274

---
Run with delx=.00000001
---
kind: 4
range: 37
precision: 6
log10 delx= -8.
log10 end= 0.E+0
Integrating f(x) from  0.E+0  to  1.  delx: 9.999999939E-9
X not incrementing!  Aborted at X= 0.25 area aborted at= 2.371826395E-2
```

### SP Processors and Nodes... Where is my Job Running?

OpenMP parallel sections spawn a number of threads. The hope is to distribute work across the multiple processors on a shared memory node so your job will finish sooner.

How do you know, however, that the scheduler isn't assigning all your threads to one processor, while the others just spin?

This is a particularly important question on icehawk because nodes, each holding 4 processors, can't be shared by multiple users. Thus, when your loadleveler job is allocated a node, it is actually allocated 4 CPUs, and if YOUR job doesn't use them, they're going to be wasted.

Fortunately, a thread can easily tell which CPU it's on using the system call,

```
int mycpu()
```

This returns an integer from 0 to the number of CPUs per node minus one.

For completeness, you can also print the node id (essentially the node IP address) using,

```
int gethostid()
```

These calls are immediately available to C/C++ codes and an interface for Fortran is easy to build.

### EXAMPLE:

Here's a simple Fortran callable interface to these routines:

```
/* File: util.c */

#include <unistd.h>

int getmyhostid ( ) {
return gethostid ( );
}

int getmycpu () {
return mycpu();
}

/* End File: util.c */
```
Compile with:
```
mpcc_r  -qsmp=omp -c util.c
```
And (if desired) create a library archive file with:
```
ar -v -q libutil.a util.o
```

Here's an updated version of the mixed-mode hello world program given in newsletter 228 . It reports the node and CPU number of each MPI task and OpenMP thread:

```
!
! Program: hello_mixed
!
! Hybrid MPI-OpenMP hello world.  Each MPI task spawns a number of
!   - MPI rank
!   - Which host it's running on
!   - Which CPU it's running on
!

program hello_mixed
implicit none

include       'mpif.h'
integer       ierr
integer my_pe,npes,iamt

integer GETMYCPU
integer GETMYHOSTID

! Initialize MPI and get my MPI rank

call mpi_init( ierr )
call mpi_comm_rank(MPI_COMM_WORLD, my_pe, ierr)
call mpi_comm_size(MPI_COMM_WORLD, npes, ierr)

! Report on myself, before starting OpenMP threads

write(6,*) 'MPI rank ', my_pe, ' BEFORE OMP PARALLEL  ',
&            'myhost', getmyhostid (), ' mycpu ', getmycpu ()

!\$omp parallel private(iamt)

write(6,*) 'MPI rank ',my_pe,' OMP thread ',iamt,
&            'myhost', getmyhostid (), ' mycpu ', getmycpu ()

!\$omp end parallel

call mpi_finalize(ierr)

stop
end
```

Compile this as follows, assuming that the util archive created above has been copied into ~/lib :

```
mpxlf_r  -qsmp=omp -ohello.mixed hello.mixed.f -l util -L ~/lib
```

A LoadLeveler script ("hello.mixed.ll") to run this on icehawk follows. It runs 1 MPI task per node on each of 4 nodes. Due to the "omp_set_num_threads(4)" call in the program, each MPI task will in turn create 4 OpenMP threads, leading to a total of 16 threads distributed across 4 nodes (one might expect one thread per processor):

```
#!/bin/ksh
#
# @ output = \$(Executable).\$(Cluster).\$(Process).out
# @ error = \$(Executable).\$(Cluster).\$(Process).err
# @ wall_clock_limit=2100
# @ job_type = parallel
# @ node = 4
# @ network.mpi = css0,not_shared,US
# @ class = Qlarge
# @ node_usage = not_shared
# @ queue

cd ~/Progs/Hello
/bin/time ./hello.mixed
```

The output of one run follows. I've sorted it, which groups the information according to MPI task:

```
icehawk\$ cat hello.mixed.ll.2886.0.out
sort

MPI rank  0  BEFORE OMP PARALLEL  myhost -1062731510  mycpu  2
MPI rank  0  OMP thread  0 myhost -1062731510  mycpu  1
MPI rank  0  OMP thread  1 myhost -1062731510  mycpu  0
MPI rank  0  OMP thread  2 myhost -1062731510  mycpu  3
MPI rank  0  OMP thread  3 myhost -1062731510  mycpu  0
MPI rank  1  BEFORE OMP PARALLEL  myhost -1062731493  mycpu  2
MPI rank  1  OMP thread  0 myhost -1062731493  mycpu  2
MPI rank  1  OMP thread  1 myhost -1062731493  mycpu  3
MPI rank  1  OMP thread  2 myhost -1062731493  mycpu  1
MPI rank  1  OMP thread  3 myhost -1062731493  mycpu  1
MPI rank  2  BEFORE OMP PARALLEL  myhost -1062731491  mycpu  1
MPI rank  2  OMP thread  0 myhost -1062731491  mycpu  1
MPI rank  2  OMP thread  1 myhost -1062731491  mycpu  0
MPI rank  2  OMP thread  2 myhost -1062731491  mycpu  3
MPI rank  2  OMP thread  3 myhost -1062731491  mycpu  0
MPI rank  3  BEFORE OMP PARALLEL  myhost -1062731489  mycpu  3
MPI rank  3  OMP thread  0 myhost -1062731489  mycpu  3
MPI rank  3  OMP thread  1 myhost -1062731489  mycpu  1
MPI rank  3  OMP thread  2 myhost -1062731489  mycpu  1
MPI rank  3  OMP thread  3 myhost -1062731489  mycpu  2
```

Each MPI task indeed ran on a different node, but the OpenMP threads were not distributed perfectly across the CPUs. In fact, there was an idle CPU on each node.

Out of curiosity, I tried a run with 2 MPI tasks on each of 2 nodes, with each MPI task spawning 5 OpenMP threads, for a total of 20 OpenMP threads across 2 nodes:

```
MPI rank  0  BEFORE OMP PARALLEL  myhost -1062731510  mycpu  0
MPI rank  0  OMP thread  0 myhost -1062731510  mycpu  0
MPI rank  0  OMP thread  1 myhost -1062731510  mycpu  2
MPI rank  0  OMP thread  2 myhost -1062731510  mycpu  1
MPI rank  0  OMP thread  3 myhost -1062731510  mycpu  2
MPI rank  0  OMP thread  4 myhost -1062731510  mycpu  3
MPI rank  1  BEFORE OMP PARALLEL  myhost -1062731510  mycpu  1
MPI rank  1  OMP thread  0 myhost -1062731510  mycpu  1
MPI rank  1  OMP thread  1 myhost -1062731510  mycpu  2
MPI rank  1  OMP thread  2 myhost -1062731510  mycpu  1
MPI rank  1  OMP thread  3 myhost -1062731510  mycpu  2
MPI rank  1  OMP thread  4 myhost -1062731510  mycpu  3
MPI rank  2  BEFORE OMP PARALLEL  myhost -1062731505  mycpu  2
MPI rank  2  OMP thread  0 myhost -1062731505  mycpu  0
MPI rank  2  OMP thread  1 myhost -1062731505  mycpu  3
MPI rank  2  OMP thread  2 myhost -1062731505  mycpu  2
MPI rank  2  OMP thread  3 myhost -1062731505  mycpu  2
MPI rank  2  OMP thread  4 myhost -1062731505  mycpu  2
MPI rank  3  BEFORE OMP PARALLEL  myhost -1062731505  mycpu  3
MPI rank  3  OMP thread  0 myhost -1062731505  mycpu  1
MPI rank  3  OMP thread  1 myhost -1062731505  mycpu  1
MPI rank  3  OMP thread  2 myhost -1062731505  mycpu  3
MPI rank  3  OMP thread  3 myhost -1062731505  mycpu  3
MPI rank  3  OMP thread  4 myhost -1062731505  mycpu  2
```

The MPI tasks were assigned 2 per node, as expected, but again, for this toy program, the OpenMP threads were not distributed optimally.

If any users have mixed-mode or OpenMP-only programs and if you have any concerns that the system isn't distributing the work effectively, you might insert a few calls to mycpu().

### SCICOMP

Three ARSC staff members attended SCICOMP in Oak Ridge a couple weeks ago. The world of IBM systems is new to many ARSC users, and as we're all coming up to speed, we thought an introduction to the organization and conference was in order. Icehawk users might consider attending one of the upcoming events:

From the SCICOMP web site:

The IBM SP Scientific Computing User Group, SCICOMP, is an international organization of scientific/technical users of IBM SP systems. The purpose of SCICOMP is to share information on software tools and techniques for developing scientific applications that achieve maximum performance and scalability on SP systems, and to gather and provide feedback to IBM to influence the evolution of SP systems. To further these goals, SCICOMP will hold periodic meetings which will include technical presentations by both IBM staff and SP users, focusing on recent results and advanced techniques. Discussions of problems with SP systems will be held and aimed at providing advisory notifications to IBM staff detailing the problems and potential solutions. Mailing lists will be maintained to further open discussions and provide the sharing of information, expertise, and experience that scientific and technical applications developers need but may not easily find anywhere else.

The schedule of upcoming meeting is:

There will be a SCICOMP Birds-of-a-Feather session, at SC 2001, in Denver, on Thursday, November 15, from 5:30 - 7:30 in room A102/104/106. All members and interested parties are invited to attend.

SCICOMP 5 will be held at Daresbury Laboratory, near Manchester, England, in the spring of 2002.

SCICOMP 6 will be hosted by the National Energy Research Scientific Computing Center (NERSC) of Lawrence Berkeley Laboratory, in Berkeley, California, in the fall of 2002.

Jim Long provided these notes from a DPCL tutorial:

DPCL is a C++ class library whose API enables a program to dynamically insert instrumentation code patches, or "probes", into an executing program. The program that uses DPCL calls to insert probes is called the "analysis tool", while the program that accepts and runs the probes is called the "target application." In addition to its API, the DPCL system consists of daemon processes that attach themselves to the target application process(es) to perform most of the actual work, and an asynchronous communication and callback facility that connects the class library to the daemon processes.

Advantages of dynamic probes include the ability to examine long running numerical codes before they complete, or codes that never end, like database servers. Because a tool built with DPCL can visualize a target application's behavior as it runs, it can avoid storing large amounts of trace or other collected data.

Complete information & documentation on DPCL can be found at http://oss.software.ibm.com/developerworks/opensource/dpcl/index.html

### Quick-Tip Q & A

```
A:[[ I want to duplicate a file, _including_ its modification time, which
[[ is actually important to me.  But I'm stuck:
[[
[[    -  "mv"  retains the mod time, but doesn't duplicate the file
[[    -  "cp"  duplicates the file, but changes the time
[[
[[ For example:
[[
[[   sgi> ll my.old.file
[[     -rw-r-----    1 myname   mygrp    2112 Nov  8  1981 my.old.file
[[
[[   sgi> cp my.old.file junk
[[   sgi> ll my.old.file junk
[[     -rw-------    1 myname   mygrp    2112 Oct 11 06:46 junk
[[     -rw-r-----    1 myname   mygrp    2112 Nov  8  1981 my.old.file
[[
[[   sgi> mv my.old.file junk
[[   sgi> ll junk
[[     -rw-r-----    1 myname   mygrp    2112 Nov  8  1981 junk
[[
[[ What's a guy supposed to do?

#
# Thanks to John Metzner, Richard Griswold, and Brad Chamberlain
#

cp -p my.old.file junk

Example:

chilkoot\$ ls -l test*
-rw-------   1 usr      grp        2136 Aug 29 19:56 test
chilkoot\$ cp -p test test.sav
chilkoot\$ ls -l test*
-rw-------   1 usr      grp        2136 Aug 29 19:56 test
-rw-------   1 usr      grp        2136 Aug 29 19:56 test.sav

# Richard's:

The 'touch' command has a nifty option that tells it to set one file's
access and modify times to the same value as another file's access and
modify times.

cp my.old.file junk
touch -r my.old.file junk

Note that cp will modify my.old.file's access time, so after you run
the commands the access time for both files will be the time at which
you ran the cp command.  The modify time for both files will be the
original modify time for my.old.file.  The change time for my.old.file
will be the original value, and the change time for junk will be the
time that touch was run.

Q: I'm writing a few specialized MPI functions for eventual distribution
as a small library.  How can I keep the underlying MPI sends/recvs
from conflicting with MPI calls elsewhere in the calling program?
Any other gotchas to worry about?

I'd appreciate it if you could just point me in the right
direction...
```

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
E-mail Subscriptions:
Archives: