ARSC T3D Users' Newsletter 24, February 24, 1995

ARSC T3D Upgrades

In the next month we will be upgrading the T3D Programming Environment (libraries, tools and compilers) from P.E. 1.1 to P.E. 1.2.

We are also planning to install CF90 and C++ for the T3D in the next few months. A description of CF90 was given in newsletter #23 and a very complete description of C++ for the T3D is given on the CRI World-Wide Web page:


  http://www.cray.com/PUBLIC/product-info/sw/C++/C++.html
I am interested in hearing from users who want to use the CF90 and C++ products as soon as they are available.

Upgrade to the T3D Memory

On February 7th, ARSC upgraded the memory on each PE from 2MWs to 8MWs. If any users have questions about this, please contact Mike Ess. We have run into problems with the user limit for mppcore size which is now set too low for the 8MW nodes. We will be changing to a larger default size in the future but if you run into the message:

  mppexec: user UDB core limit reached, mppcore dump terminated
then call us and we will increase your mppcore limits. The T3D can create tremendously large mppcore files, the storage for these files are charged against your service unit allocation. If you're not going to use mppcore files, delete them as soon as you don't need them. These files appear in the directory of the executable that aborted with the name "mppcore". Y-MP jobs that abort produce a corefile called "core".

New Shmem Manuals from CRI

I finally got in copies of the new SHMEM manuals:

  SN-2516 SHMEM Technical guide for Fortran Users
  SN-2517 SHMEM Technical guide for C Users
for those of you who are ARSC users (i.e., you have a userid on denali) I will send you a hardcopy if you e-mail me your U.S. Mail address.

FMlib Available on ARSC's T3D

From at least four places I received the following notice:

  >                 FM - Fast Messaging on the Cray T3D
  >                 -----------------------------------
  > 
  >   The FM library contains fast messaging primitives which exploit
  > special features of the Cray T3D hardware to provide very low latency
  > for short messages.  FM provides an order of magnitude lower latency
  > than Cray's PVM and achieves performance comparable to SHMEM get while
  > providing a message-passing interface.
  > 
  >   The FM library provides two distinct sets of primitives which make
  > use of the T3D fetch-and-increment and atomic swap hardware
  > respectively.  The fetch-and-increment primitives are optimized for
  > the lowest possible latency and are suitable for situations with light
  > communication traffic.  The atomic swap primitives eliminate output
  > contention at the cost of slightly higher latency, but by doing so can
  > deliver robust performance even for heavy and unbalanced traffic loads.
  > 
  > 
  >   Release 1.0 of the library is now available from our WWW server:
  > 
  >     http://www-csag.cs.uiuc.edu/projects/communication/t3d-fm.html
  > 
  >   The library can also be accessed on the T3D at Pittsburgh
  > Supercomputing Center (mario.psc.edu) from the directory: 
  > 
  >     /usr/users/9/karamche/FM-1.0
  > 
  >   The release contains the source files (C and Assembly), the library
  > (libFM.a), and an include file which provides the function prototypes.
  > The release directory also contains the usage manual and a copy of a
  > paper analyzing the performance of the two sets of FM primitives.  The
  > latter is a preliminary version of the paper which will appear in the
  > Proceedings of the 22nd International Symposium on Computer
  > Architecture (ISCA'95). 
  > 
  > 
  >   Please contact me if you have any questions, comments or problems.
  > 
  >                         Vijay Karamcheti
  >                         vijayk@cs.uiuc.edu
  >                         (217) 244-7116
  > 
  >                         Concurrent Systems Architecture Group
  >                         Department of Computer Science
  >                         University of Illinois at Urbana-Champaign        
  >                         1304 W. Springfield Avenue
  >                         Urbana, IL 61801
  > 
I down loaded the files from the WWW server and installed the newest version (1.1) on denali. The include file "fm.h", which is needed, is in the directory:

  /usr/local/examples/mpp/include
the needed library is:

  /usr/local/examples/mpp/lib/libFM.a
I have gotten some FM test cases running and I will describe some of the library routines next week. At the WWW server:

  http://www-csag.cs.uiuc.edu
there are several interesting papers:

FM: Fast Messaging on the Cray T3D Vijay Karamcheti and Andrew Chien

This paper has a nice comparison figure of latency times for PVM and SHMEM as well as the FM libraries routines.

A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D Vijay Karamcheti and Andrew Chien

This paper has a good description of T3D hardware support for message passing.

Libsci Routines and Improved Speed on the T3D

At ARSC we use Linpack in various forms as part of our regression tests; We solve the Linpack problem for various sizes in C and Fortran, single PE and multiple PEs. These tests take a while to run especially for the large matrices that might fit in a 8MW node, so I've looked into speeding up the public domain versions of Linpack with calls to the libsci versions of the BLAS1 routines. All of the results below are for the single PE case.

The standard linpack source as distributed from netlib comes with all the Linpack subroutines, BLAS1 subroutines and auxiliary subroutines needed to execute the benchmark. This means that this unmodified benchmark will call the Fortran versions of the BLAS1 routines rather than the optimized versions that are in the library /mpp/lib/libsci.a. By deleting the BLAS1 routines from the Fortran source, the library routines from libsci will now be used when the calls to BLAS1 routines are used in the Linpack benchmark. The BLAS1 routines called by the Linpack benchmark are:


  SDOT   - computes the dot product of two vectors
  ISAMAX - finds the position of the maximum absolute value
           of a vector
  SAXPY  - computes the vector sum of a vector and a scalar
           multiple of another vector
  SSCAL  - scales a vector by a scalar
Using the same technique of Newsletter #22 we can time these routines from both the provided Fortran source and from calls to libsci routines. Below are the results of these timings. In all cases, eventually the libsci routine is faster than the straight Fortran source and for each routine the libsci version is eventually more than twice as fast as the Fortran version. The crossover points, where the libsci routine becomes faster, are shown in the table below with asterisks.

       Fortran libsci Fortran libsci Fortran libsci Fortran libsci

     0  0.00   0.00    0.00   0.00    0.00   0.00    0.00   0.00
     1  0.17   0.10    0.28   0.20    0.23   0.13    0.23   0.10
     2  0.43   0.27    0.45   0.36    0.50   0.26    0.44   0.17
     3  0.51   0.39    0.64   0.53    0.75   0.36    0.60   0.25
     4  0.73   0.50    0.86   0.70    0.87   0.49    0.83   0.35
     5  0.80   0.63    1.02   0.34    0.70   0.59    1.04   0.47
    10  1.59   1.22    1.69   0.82    1.29   1.12    1.88   0.87
    20  2.51   1.94    2.59   1.55    2.07   1.80    3.17   1.53
    40 *3.63***3.63*   3.59   3.01   *3.03***3.08*   4.72   2.70
    50  4.04   4.58    3.86   3.50    3.34   3.94    4.93   3.34
    60  4.17   5.30   *4.07***4.15*   3.57   4.24    5.35   3.71
    70  4.68   3.43    3.21   4.90    3.83   3.88    5.02   4.28
    80  4.66   3.70    3.32   5.49    3.92   4.14    5.24   4.65
    90  4.83   4.25    3.41   5.95    4.14   4.92    5.51   5.08
   100  4.90   4.46    3.52   6.42    4.18   4.99    5.81   5.56
   200  5.31   6.82    3.93   9.67    4.74   7.12   *7.17***8.98*
   300  5.73   8.51    4.04  11.49    5.01   8.72    7.75  11.36
   400  5.84   9.62    4.16  12.98    5.16   9.56    8.05  12.95
   500  5.91  10.70    4.21  13.98    5.22  10.49    8.31  14.22
   600  5.96  11.22    4.26  14.70    5.29  10.96    8.45  15.30
   700  6.02  11.89    4.28  15.05    5.33  11.46    8.56  16.23
   800  6.05  12.28    4.31  15.58    5.37  11.76    8.66  16.82
   900  6.05  12.74    4.33  15.94    5.38  12.03    8.74  17.29
  1000  6.08  12.82    4.35  16.21    5.40  12.18    8.81  17.88
  1500  6.15  14.06    4.39  17.23    5.46  13.00    8.96  19.43
  2000  6.20  14.59    4.42  17.82    5.50  13.35    9.05  20.44
  2500  6.23  15.06    4.43  18.13    5.52   8.92    9.11  21.12
  2600  6.24  15.09    4.43  18.17    5.53  13.68    9.13  21.21
For large Linpack problems it looks like the substitution with the libsci routines is a good idea for speeding up the regression tests. However we must remember that in Gaussian Elimination, which is what the Linpack benchmark does, the algorithm updates a progressively smaller submatrix and so the vectors become progressively smaller as the algorithm executes. Below are the timings of the Linpack benchmark for increasing problem sizes on one PE.

  Problem     Linpack Mflop rates    
    size
            Unmodified  Version using
            Version     Libsci BLAS1 routines

     1           .19      .13       
     2           .45      .30      
     3           .90      .30     
     4          1.50     1.04    
     5          2.21     1.28   
    10          5.04     2.92  
    20          8.72     6.07 
    40         11.04    10.72              
    50        *10.93****12.29*            
    60         11.10    13.49            
    70         11.21    13.38           
    80         11.43    13.31          
    90         11.55    13.46         
   100         11.64    13.75        
   200         12.08    17.30       
   300         12.03    19.66      
   400         11.84    21.20     
   500         11.60    22.26    
   600         11.36    23.03   
   700         11.21    23.61  
   800         11.14    24.14 
   900         10.97    24.40              
  1000         10.83    24.69             
  1500         10.15    25.62            
  2000          9.83    26.18           
  2500          9.70    24.61          
  2600          9.70    26.49         
These optimizations are fine but we are not done yet, in the next newsletter we'll solve the same problems using the new LAPACK routines.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top