ARSC T3D Users' Newsletter 30, April 7, 1995

The Switch to the 1.2 Programming Environment

This Monday April 10th we will switch from the 1.1 Programming Environment (PE) to the 1.2 PE. For the past two weeks the 1.2 product was available to users by using a special path but on April 10th it will be the 1.1 PE product that will be accessed with a special pathname and the 1.2 products will be in the default locations. After April 10th, if there are any problems you feel are due to the new 1.2 PE please contact Mike Ess.

Besides being generally faster and having fewer bugs, the 1.2 PE also has new features and some differences from the 1.1 PE. Here is a brief list of these new features and differences:

  1. New Fortran 90
    1. Limited implementation of the Fortran 90 Standard
    2. Doesn't Support Craft Fortran Extensions
    3. Does implement a 32 bit float (REAL*4)
    4. Implements many Fortran 90 intrinsics for shared and private data... (I'll list the entire Fortran 90 feature list in the next newsletter)
  2. New assembler CAM and new C++ compiler
  3. New libm routines
    1. New 32 and 64 bit routines for SQRT, SIN, COS, EXP, ALOG, TAN, COT, ACOS ASIN and ATAN.
    2. New error handling policy (see below)
  4. New libsci routines
    1. New BLAS routines, CGERC, CGERU, SSYR, SSPR, CHER, CHPR, CSYR, CSPR
    2. New mixed radix FFT routines, CCFFT, CSFFT, and SCFFT
    3. New PBLAS routines
  5. New SHMEM 32 bit routines (libsma.a)
  6. Totalview can now start up both the debug and program in one step
  7. Apprentice can now save results to a file
  8. New versions of "network" PVM and the T3D PVM that follow PVM 3.3 from ORNL

Using the new Fortran 90 Compiler on the T3D

With the new 1.2 Programming Environment there are two new products for the T3D, a Fortran 90 compiler and a C++ compiler. I believe that in the future CRI will use these two products as their vehicles for introducing new features into the T3D/T3E world. One of the features in the Fortran 90 compiler that will not be added to Craft Fortran is the 32 bit float implemented with the REAL*4 declaration. On ARSC's 8MW node this allows the user to have a single array of 14.5 million 32 bit reals.

One of our users, Dr. Alan Wallcraft a scientist with the Naval Research Center in Stennis, Mississippi, sends us this description of using the 32-bit float with Fortran 90:


  > Mike,
  > 
  > REAL*4 seems to work ok, but it would be much cleaner to have a -r4
  > switch (or something) that set the default REAL to REAL*4 (and DOUBLE
  > PRECISION to REAL*8).  I had no problems using the REAL4 data type from
  > PVM to transfer 32-bit numbers (just declare data to be transfered as
  > REAL*4 and use REAL4 in place of REAL8).
  > 
  > In general, you can convert to 32-bit by:
  > 
  > a) Unless using IMPLICIT NONE, add  IMPLICIT REAL*4 (A-H,-O-Z) to all
  > programs and routines.
  > 
  > b) Replace REAL with REAL*4 throughout.
  > 
  > c) Either use the -d p switch to f90, or replace DOUBLE PRECISION with
  > REAL*8 (I have only tested -d p).
  > 
  > d) You may get mixed-type errors from INTRINSICs, e.g.  replace:
  >      REAL*4 R4
  >      R4 = MAX( 1.0, R4 )
  > with
  >      REAL*4 R4
  >      R4 = MAX( 1.0, REAL(R4) )
  > or with
  >      REAL*4 R4,ONE
  >      ONE = 1.0
  >      R4  = MAX( ONE, R4 )
  > because 1.0 is a REAL*8 constant.
  > 
  > e) Mixing REAL*4 and REAL*8 or INTEGER in COMMON is probably a bad idea.
  > You may already have encountered this on other 32-bit machines with
  > REAL and DOUBLE PRECISION in the same COMMON, but note that here INTEGER's
  > are 64-bit so they should also not be mixed with REAL*4's.  On a typical
  > 32-bit machine mixing REAL*4 and INTEGER in the same COMMON would be ok,
  > but mixing REAL*8 and INTEGER would be dangerous.  The only safe strategy
  > is: ONE DATA TYPE PER COMMON.
  > 
  > f) Replace REAL8 with REAL4 in PVM calls.
  > 
  > g) Use a compile line such as:
  > env TARGET=cray-t3d /mpp/bin/f90 -V -f fixed -O 2 -d p file.f
  > I discovered the "env" command today, it is particularly useful when
  > compiling via make (e.g. FORTRAN = env TARGET=cray-t3d /mpp/bin/f90)
  > 
  > I think the resulting program will probably still work with cf77, except
  > that REAL*4's will be treated as REAL (i.e. REAL*8) so the PVM REAL4's
  > would need to be put back as REAL8's.
  > 
  > Alan.

Using the new C++ Compiler on the T3D

There is now a C++ compiler available for the T3D. It is derived from ATT's cfront product so it produces a C program that goes through the C compiler. Like any new product it has shortcomings when compared to the mature existing compiler. The makefile I used to access the compiler is something like:

  CC = /mpp/bin/CC
  CFLAGS =  -O2 -X 1 -c
  CLD = mppldr 
  CLDFLAGS = -lC -lfi
  
  .c.o:
          $(CC) -c $(CFLAGS) $<
  
  main:        $(SUBS)
          $(CLD) $(CLDFLAGS) -o main $(SUBS)
The manpage for CC on denali gives some of the command line options available. To get the C programs running using the C++ compiler, I usually need to add the following include files:

  #include <stdlib.h     /* get exit(), ... */
  #include <malloc.h>
  #include <stream.h>    /* get printf( ... ), ... */
This is the first release of the C++ compiler for the T3D and it probably has some bugs to be found and corrected. One of my C regression tests which runs OK on the C compiler fails on the C++ compiler with the message:

  /mpp/bin/CC -c -O2 -X 1 -c main.c
  CRI C++: compiler limit exceeded: "main.c", line 152: blocks too deeply nested
  1 error
  Make: "/mpp/bin/CC -c -O2 -X 1 -c main.c": Error code 1
As the product matures these problems will be fixed. To do some rudimentary timings, I add the following line to a C++ program:

  extern "C" double second();
and link in the object file made from this C function:

  double second()
  {
    double junk;
    fortran irtc();
    junk = IRTC( ) / 150000000.0;
    return( junk );
  }
I have a C version of the old whetstone program from Netlib at Oak Ridge National Labs and I use it as part of my regression tests for the C compiler. With a few changes, it goes through the C++ compiler and I tested the effect of some of the optimization switches on both compilers.

  Performance of the Whetstone Benchmark
  (in whetstones, bigger is better):

                          Compilers on the T3D
                   -----------------------------------
  compiler option   C++   /mpp/bin/ccnew   /mpp/bin/cc
  ---------------  -----  --------------   -----------
      none         31779       36146          32095
       -O          31779       36145          32095
      -O1          52447       70004          53028
      -O2          54314       68959          53578
      -O3          52091       69021          53951
C++ has many more features than just being used as a C compiler but we'll cover those later.

Using the new Math Intrinsics (libm.a) on the 1.2 PE

One of the tests I've run on the new 1.2 PE is the ELEFUNT package of Cody and Waite and distributed by ORNL through Netlib. This package tests the accuracy and behavior of the Fortran intrinsics (SQRT, SIN, COS, EXP, ALOG, TAN, COT, ACOS, ASIN and ATAN). The new 1.2 PE intrinsics are generally more accurate as measured by the ELEFUNT package. But the ELEFUNT package also calls the intrinsics with values that should produce a floating point error because the intrinsic is not defined for that input or a reasonably accurate result can not be produced for that input. For such inputs, the 1.1 PE intrinsics would return an IEEE NaN or an Infinity, in the 1.2 PE the intrinsics aborts with a floating point interrupt message. There are proponents of both strategies but I just want users to know that the CRI implemented strategy has changed.

A Description of What's in BENCHLIB for the T3D, the Utility perf (lib_util.a)

At the Denver CUG, Jeff Brooks gave a 2 hour description of Single PE Optimization on the CRAY T3D that generated a lot of interest in BENCHLIB. Basically BENCHLIB is a collection of very fast unsupported routines for the T3D. Jeff has let ARSC distribute the BENCHLIB source through the ARSC ftp server. Using the anonymous login id you can find the BENCHLIB source, and postscript files of this paper and slides from the Denver CUG. The ftp site is ftp.arsc.edu and the files are in the directory pub/submissions, the files are:

  -rw-r--r-- 1 ftp  other  1634893 Mar 24 13:33 cug_slides.ps.Z
  -rw-r--r-- 1 ftp  other   144153 Mar 24 13:17 libbnch.tar.Z
  -rw-r--r-- 1 ftp  other    69077 Mar 24 13:33 t3d_opt.ps.Z
BENCHLIB consists of 6 libraries of optimized routines:

  lib_32.a
  lib_scalar.a
  lib_util.a
  lib_random.a
  lib_tri.a
  lib_vect.a
These libraries are available on Denali in the directory:

  /usr/local/examples/mpp/lib
and sources in:

  /usr/local/examples/mpp/src
In the library lib_util.a is a utility called perf that can "prefetch" values into cache before they are used, this gives a software capability of managing cache. The utility perf is described in Jeff's paper. Below is the example given in Jeff's paper for multiplying "bars" of 3 by 3 matrices.


  call mm3v4p( a, b, c, 1024 )

  .
  .
  .

  subroutine mm3v4p( a, b, c, n )
  real a( 3, 3, n ), b( 3, 3, n ), c( 3, 3, n )
  do l1 = 1, n, 32
    l2 = min( l1+31, n )
    call pref( 72, a( 1, 1, l1 ), b( 1, 1, l1 ) )
    do l = l1, l2
    c(1,1,l)=a(1,1,l)*b(1,1,l)+a(1,2,l)*b(2,1,l)+a(1,3,l)*b(3,1,l)
    c(2,1,l)=a(2,1,l)*b(1,1,l)+a(2,2,l)*b(2,1,l)+a(2,3,l)*b(3,1,l)
    c(3,1,l)=a(3,1,l)*b(1,1,l)+a(3,2,l)*b(2,1,l)+a(3,3,l)*b(3,1,l)
    c(1,2,l)=a(1,1,l)*b(1,2,l)+a(1,2,l)*b(2,2,l)+a(1,3,l)*b(3,2,l)
    c(2,2,l)=a(2,1,l)*b(1,2,l)+a(2,2,l)*b(2,2,l)+a(2,3,l)*b(3,2,l)
    c(3,2,l)=a(3,1,l)*b(1,2,l)+a(3,2,l)*b(2,2,l)+a(3,3,l)*b(3,2,l)
    c(1,3,l)=a(1,1,l)*b(1,3,l)+a(1,2,l)*b(2,3,l)+a(1,3,l)*b(3,3,l)
    c(2,3,l)=a(2,1,l)*b(1,3,l)+a(2,2,l)*b(2,3,l)+a(2,3,l)*b(3,3,l)
    c(3,3,l)=a(3,1,l)*b(1,3,l)+a(3,2,l)*b(2,3,l)+a(3,3,l)*b(3,3,l)
    enddo
  enddo
  end
Without the call to perf, the subroutine call takes .00162 seconds, with the call it takes .00127, which is a good speedup for a one line change.

Additional Information from the Denver CUG

I have copies of the following papers that I can mail to anyone who contacts me:
  1. Operation of the CRAY T3D as a National Facility, Michael Brown, University of Edinburgh, (paper and slides)
  2. Steve Johnson's Hardware Report (slides only)
  3. T3D at ECMWF (paper only)

ARSC T3D Future Upgrades

We are testing the upgrade to the T3D 1.2 Programming Environment (libraries, tools and compilers.) The new compilers and libraries are available for users to try out. The command on denali:

  news compilers 
 more
will provide the details of the names and paths to these new versions. This upgrade includes the new Fortran 90 and C++ compilers for the T3D. If you have any problems or find any differences please contact Mike Ess.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top