ARSC T3E Users' Newsletter 128, October 24, 1997

Retirement Date for ARSC's T3D, November 18

With this newsletter, we have produced one issue per node of the ARSC T3D. I wonder if we will retire yukon after another 104 issues.

At any rate, it's true. The ARSC T3D, which, on February 28, 1994 was the 6th T3D brought on-line, is to be disconnected. This will occur during test time on November 18th.

Users are welcome to continue using the T3D through the 18th, but if you haven't done so yet, please port to yukon immediately, verify that your code compiles and runs correctly, and make cross-comparisons. ARSC is providing special training to help you (next article) and consultants are available at: consult@arsc.edu .

Classes Scheduled: "Migration from T3D to T3E" and "T3E Tools"

Migration from T3D to T3E

November 5th, (1/2 day 1300-1600), in Fairbanks

This course is targeted at current users of the Cray T3D system who will be migrating codes to the new Cray T3E system. The course will discuss the following:

  • Architectural similarities/differences.
  • Software similarities/differences.
  • How to exploit Cray T3E performance features.
  • Case studies from users who have already completed the migration process.

At the end of this course attendees will be in a better position to migrate software from the Cray T3D to the Cray T3E.

Tools, how to Investigate Your Code's Behaviour on the T3E.

December 10th, (1/2 day + 1/2 day hands-on session), in Fairbanks

The purpose of this course is to describe the range of tools currently available on the ARSC T3E system. There are a number of tools installed which can help users develop efficient portable, parallel programs. These include:

  • Totalview
  • PAT
  • Apprentice
  • VAMPIR

The course will cover the following aspects of behaviour,

  • debugging code when encountering problems.
  • investigating and improving single node performance.
  • parallel performance, how to achieve scalability.

Several case studies from current users will illustrate these tools and the potential benefits in terms of reduced programming effort and improved performance.

This course is aimed at existing users of the ARSC parallel systems. Users are encouraged to bring problems for discussion and investigation during the afternoon hands on session when consultants from ARSC will be available to help with individual problems.

To register for these, and other ARSC courses, please follow the instructions at our training page:

http://www.arsc.edu/user/Classes.html

Case Study: Streams and 450MHz Clock

[ Don Morton of the University of Montana contributes this article. It shows the performance of one code measured across ARSC's recent T3E upgrade and with streams on and off. ]

Application: Adaptive finite element code for modeling two-phase oil/water flow in porous media.

Using a "heterogeneous" SPMD approach, one processor is responsible for dynamically modifying the mesh at specified intervals (typically every time step) to place smaller elements in regions of current activity and larger elements elsewhere. The mesh modification step also includes a load-balancing algorithm for processor assignments.

The mesh is then partitioned out to other processors for a parallel solution of the next time step. Because the equations are nonlinear, a single time step may involve several iterations for convergence. When the parallel solution has been obtained, it's shipped back to the "mesh modification processor" for another round of mesh modification and solution of the next time step.

Results, below, show the time (in seconds) required for two of the time steps. It was run on 16 PE's (1 PE for mesh modification, 15 for the parallel solution). PVM is used for the message-passing. In the first time step, there were 5320 unknowns (degrees of freedom) and in the second, 5336 unknowns.


               Mesh Mod Time     Mesh Transfer Time     Total Time Step
 Time Step        (secs)              (secs)               (secs)
 =========     =============     ==================     ==============
[A]
     1             1.33                 0.27                9.08
     2             1.32                 0.26                7.08
  
[B]
     1             2.52                 0.26                11.80
     2             2.36                 0.27                10.07
  
[C]
     1             3.02                 0.34                14.92
     2             2.86                 0.34                11.36


[A] ARSC 450MHz T3E with streams enabled,

[B] ARSC 450MHz T3E without streams enabled,

[C] ARSC T3E before upgrade: (300MHz, without streams enabled).

------------------------------------------------------------------------

The speed up due to configuration changes [A] and [B] above, relative
to [C] are:

 Time Step    Relative Run-Time    Speed-Up
 =========    =================    ========
[A]
     1             61%              1.64x
     2             62%              1.61x

[B]
     1             79%              1.27x
     2             89%              1.12x

[ Editors' request: If you have similar data for your application, please send it in! ]

PE3.0 Now Default on Yukon

Programming Environment 3.0 (PE3.0) for the T3E was made the default during test time on October 14, 1997. By simply recompiling all objects and re-linking using the default, you will obtain the latest compilers and libraries, which correct some known system stability problems.


<<  
All yukon users requested to recompile and re-link all code.
 >>

For the purpose of comparison, you may restore the previous PE by executing the command:


  module switch PrgEnv PrgEnv.old

And, having done so, would switch back to the PE3.0 default by executing:


  module switch PrgEnv.old PrgEnv

PAT: PE3.0's Low-Overhead MPP Performance Analysis Tool

Many T3D/T3E users will already be familiar with the features of Apprentice, a tool which provides comprehensive information about the performance of programs through a powerful graphical interface. Apprentice presents low-level information, reporting on each line of the source code and providing a great deal of information on particular aspect of performance, such as memory loads and stores and calling trees.

The down-side of Apprentice is the cost of this information. The run-time of programs increases by a factor of 3-4 when compiled with Apprentice enabled and the graphical interface requires a good connection to the host system(*).

PAT is a simpler, less intrusive, text-based alternative which provides information on basic aspects of performance, such as:

  • time spent in subroutines and functions,
  • load balance across processors at a subroutine and function level,
  • simple call sequence and tracing.

PAT can be used with C, C++, and Fortran 90 programs. Users simply re-link with the PAT run-time library and pass the pat.cld loader directive as shown below.


       yukon$ cc *.o -l pat pat.cld -o a.out

       yukon$ CC *.o -l pat pat.cld -o a.out

       yukon% f90 *.o -l pat pat.cld -o a.out

After running the resulting executable, a pdf.<nnnn> file will have been generated from which the user can either extract specific information using PAT's command line or interactive mode.

The command line interface allows users to employ PAT within batch jobs to obtain performance information on long runs with realistic data sets. The interactive mode allows general investigation of program performance.

Our experience with a few applications shows little impact on overall program run-time compared against normal execution. It should be noted that PAT periodically samples the program counter so results are a statistical estimate of the time spent in each subroutine/function. This is the same approach taken by HPM, the hardware performance monitor, on Cray vector machines, and in most cases is perfectly adequate.

The following output shows two of PAT's capabilities on a 19 PE run of a locally parallelised seismic hazard code.

CODE PROFILING

pat -p gives a profile of the code showing the percent of time spent in each subroutine. The final column gives a measure of the statistical confidence of the sampled measurements.


    yukon% pat -p ./a.out pdf.75028
    
    Profile Information:
                                            Percent         90% Conf.
                                                            Interval
    
    PSTEP                                     55%           0.1
    GTRAN                                     9%            0.1
    MPI_Bcast                                 8%            0.1
    IGTRAN                                    3%            0.0
    RADFG                                     3%            0.0
    RADBG                                     3%            0.0
    RADF3                                     2%            0.0
    RADB3                                     2%            0.0
    RFFTF1                                    1%            0.0
    D3FFT                                     1%            0.0
    _T3EMPI_unbalanced_tree                   1%            0.0
    D2FFT                                     1%            0.0

LOAD BALANCE

pat -h <subroutine name> gives information on the load balance across the processors for the named subroutine. The following output shows that processors 16-18 have relatively little work to do for PSTEP.


    yukon% pat -h PSTEP ./a.out pdf.74322
    
    Load Balance Histogram for PSTEP
         --------------------------------------------------
       0 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
       1 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
       2 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
       3 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
       4 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
       5 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
       6 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
       7 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
       8 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
       9 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
      10 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
      11 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
      12 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
      13 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
      14 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
      15 
****
****
****
****
****
****
****
****
****
****

         --------------------------------------------------
      16 
*   
    
    
    
    
    
    
    
    
    

         --------------------------------------------------
      17 
*   
    
    
    
    
    
    
    
    
    

         --------------------------------------------------
      18 
*   
    
    
    
    
    
    
    
    
    

         --------------------------------------------------

There also exist a number of routines so users can create trace files of specific activity to satisfy many individual reporting requirements. For details: man pat .


  * While the graphical interface of Apprentice requires a good 
    connection to the host system, it does have a -r option which
    prints a textual report to standard output (stdout) that summarizes
    program performance information.  This report gives the total time
    of inclusive and exclusive subroutines and breaks down the time
    into time spent in overhead, in parallel work, and in I/O.

Quick-Tip Q & A



A: {{ In Unix, how can you list the contents of the current directory,
      and the contents of every subdirectory within the current
      directory, but not descend any deeper into the depths of the tree
      of subdirectories? }}

   # This was a tricky one!  You might use "find . -prune ..." or
   # "ls `ls`", but to be concise, try this:

   ls *


Q: Fortran 90 programmers: what does this do, and why?

cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
       program my_other_car_is_a_dog_sled
       integer j(10)

       do 100 i = 1, 10
         j(i) = i
 100   continue

       do 200 i = 1. 10
         j(i) = j(i) + i * 100
 200   continue

       print*, j
       end
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top