ARSC HPC Users' Newsletter 213, February 9, 2001

ARSC Advanced Technology Panel, Abstracts

ARSC is hosting several prominent members of the HPC community for discussions and talks concerning directions in advanced technology. The following talks are open to interested parties:

HPCMP Requirements Cray Henry Tuesday, Feb. 13, 2 pm, 401 IARC

Mr. Cray Henry will provide a 45-minute overview of the High Performance Computing Modernization Program (HPCMP), a description of HPCMP requirements and a summary of the benchmark results submitted by vendors. He will also present a high level discussion of the Programming Environment and Training (PET) activity, intended to gather and deploy the best ideas, algorithms, and software tools emerging from the national high performance computing infrastructure into the DoD user community.

Strategic Planning at NCAR Steve Hammond Wednesday, Feb. 14, 9:15 am, 401 IARC

During 2000, Dr. Steve Hammond chaired a committee at the National Center for Atmospheric Research (NCAR) which developed a strategic plan for high performance scientific simulation. This comprehensive plan addresses the way that high performance computers are deployed at NCAR, the management code development projects, the need for more formalized software engineering processes, algorithmic research, data: management, processing and its visualization, and the need for infrastructure to facilitate research among geographically distributed scientists and resources. This presentation will discuss the strategic plan and new efforts underway at NCAR to implement it.

NERSC: A Supercomputer Facility for the Next Millennium Bill Kramer Wednesday, Feb. 14, 10:15 am, 401 IARC

The National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory is one of the nation's most powerful unclassified computing resources and a world leader in accelerating scientific discovery through computation. NERSC's vision is to combine the leading edge facilities with intellectual services to accomplish breakthroughs in computational science. This presentation will be an overview of how this balance is met and how NERSC uses the concepts of "Service Architecture" to complement the traditional "System Architecture" approach to Supercomputing facilities. Bill Kramer, Deputy Director of NERSC, will describe the computational resources at NERSC, in particular the new 3.8 Tflop/s IBM SP as well as storage systems with a 1 petabyte capacity and Linux clusters.

The Accelerated Strategic Computing Initiative (ASCI) Program Jim McGraw Thursday, Feb. 15, 9:00 am, 401 IARC

ASCI is the computational portion of the Stockpile Stewardship Program (SSP). The SSP is responsible for creating new plans that will insure the safety and reliability of the US nuclear stockpile as it ages. The SSP relies heavily on the development of sophisticated computational models to help evaluate this safety and reliability. This talk will give a broad overview of the ASCI objectives and describe how those objectives are translated into specific needs. Issues that will be addressed by Dr. Jim McGraw include: achieving adequate performance on specific application problems, transition plans between system generations, remote access capabilities for other DP labs, post-processing capabilities, power consumption and floor space management.

Toward Predictions of Arctic Environmental Change Wieslaw Maslowski Thursday, Feb. 15, 9:00 am, 401 IARC

Understanding of short-to-long term variability of ice extent and mass in the Arctic Ocean, fresh water budget, deep water formation and export is crucial in assessment of the sensitivity of this region to global climate change and its role in driving climate variability. Coupled ice-ocean modeling of the pan-Arctic region constitutes and integral part of multi-agency supported efforts to advance arctic science. A major part of the Naval Postgraduate school's Arctic Modeling effort involves high resolution modeling of the Arctic Ocean and sea ice with prescribed realistic atmospheric forcing. Dr. Wieslaw Maslowski will discuss recent results of model comparisons including different spatial resolutions and the role of increased resolution on representation of ocean and ice physics and thermodynamics from small to large scales.

SV1 GFLOP Contest Winner

It ain't easy to beat a GFLOP on a single SV1 processor!

This made judging the contest easy. In the end, I only got one successful entry, and that was from illustrious co-editor, Guy Robinson. Thus, I'll be treating Guy to a cool draught of his choosing, in some local pub of my choosing, most-likely overlooking the frozen, yet still mighty, Chena River.

Back to GFLOPS.

From the Benchmarker's Guide to CRAY SV1 Systems: "Each [SV1] CPU has 2 add and 2 multiply functional units, allowing each CPU to deliver 4 floating point results per CPU clock cycle. With the 300 MHz CPU clock the peak floating point rate per CPU is 1.2 Gflops/s."

Here are Guy's comments on the program:

  • The dataset in the inner loop is long enough to get a vector going.

  • The datasets references in the outer iterative loop all fits in cache.

  • There are lots of floating point operations with the same data items, which are all in cache. A high order polynomial. (Remember the goal was flops!) The lower the polynomial the lower the flops rating.

  • If we'd made the datasize and loop sizes the same, i.e. 120 then we get a clash on memory access. Sized to 120 gets only 1019.14MFloating ops/CPU second

Here's the timing, using hpm:


CHILKOOT$ f90 -O3 cache_test.f
CHILKOOT$ export NCPUS=1
CHILKOOT$ hpm ./a.out
 STOP   executed at line 252 in Fortran routine 'CACHE_TEST'
 CPU: 810.644s,  Wallclock: 812.186s,  3.1% of 32-CPU Machine
 Memory HWM: 734439, Stack HWM: 49799, Stack segment expansions: 0
Group 0:  CPU seconds   :  810.64491   CP executing     :   243193473792

Million inst/sec (MIPS) :   20.87      Instructions     :    16915248327
Avg. clock periods/inst :   14.38
% CP holding issue      :   92.60      CP holding issue :   225187690857
Inst.buffer fetches/sec :    0.00M     Inst.buf. fetches:         500539
Floating adds/sec       :  520.57M     F.P. adds        :   421997857775
Floating multiplies/sec :  513.70M     F.P. multiplies  :   416429414904
Floating reciprocal/sec :    3.49M     F.P. reciprocals :     2832200008
Cache hits/sec          :   15.43M     Cache hits       :    12504952732
CPU mem. references/sec :   28.19M     CPU references   :    22850939545

Floating ops/CPU second : 1037.77M
And Guy's program:

      program cache_test

      implicit none


      integer :: idim,jdim
      parameter (idim=121,jdim=121)

      real, dimension(idim,jdim) :: a,c

      integer i,j

      integer iblock,jblock


      integer iter,niter

      niter=20000

      iblock=idim-1
      jblock=jdim-1

      do i=1,iblock
       do j=1,jblock
         a(i,j)=0.0
         c(i,j)=(i+j*idim)/jdim

       enddo
      enddo

      do iter=1,niter

      do i=2,idim-1
       do j=2,jdim-1
         a(i,j)=
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(
     !   c(i,j)*(c(i,j)+1)-1)+1)+1)+1)+1)
     !   +1)+1)+1)-1)+1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1

         a(i,j)=a(i,j)+
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(
     !   c(i-1,j)*(c(i,j)+1)-1)+1)+1)+1)+1)
     !   +1)+1)+1)-1)+1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1)-1)+1)
     !   +1

      a(i,j)=a(i,j)/2.0
         a(i,j)=a(i,j)+
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     (c(i,j)*(c(i,j)+1)+1)+1)+1)+1)+1)
     !     +1)+1)+1)+1)+1)+1)
     !     +1)-1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !     +1)-1)+1)
     !     +1)+1)+1)
     !     +1)-1)+1)
     !     +1)+1)+1)
     !     +1)-1)+1)
     !   +1

         a(i,j)=a(i,j)/(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     c(i,j)*(
     !     (c(i,j)*(c(i,j)+1)+1)+1)+1)+1)+1)
     !     +1)+1)+1)+1)+1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !     +1)+1)+1)
     !   +1)
       enddo
      enddo

      c=a/5.0

      enddo

      write(20) a

      stop
      end

Other notes on performance:

  • I/O and memory bandwidth are often the bottlenecks, not CPU performance.

  • The system-wide average performance achieved on Chilkoot by real user jobs has changed over the months, but is typically about 200 MFLOPS per CPU. The best individual users run consistently at 400-500 MFLOPS per CPU.

  • We run "hpmflop" on chilkoot to do passive monitoring of performance and can provide historical data to users concerning their own jobs. Contact consult@arsc.edu . Better yet, we encourage you to monitor your own jobs. See the article on "HPM", in:

    /arsc/support/news/hpcnews/hpcnews207/index.xml

  • Many codes do not scale well as the number of processors is increased. For discussion, see the article on Amdahl's Law, in:

    /arsc/support/news/hpcnews/hpcnews210/index.xml

  • Cray provides an excellent manual on performance tuning, titled,

    Optimizing Code on Cray PVP Systems, SG-2192

    This is available on ARSC's dynaweb server,

    http://www.arsc.edu:40/

  • Performance depends on a various factors... your mileage may vary. ARSC users are always welcome to contact consult@arsc.edu for help understanding or resolving performance issues.

OpenMP MASTER vs SINGLE Follow Up

[[ Thanks to Alan Wallcraft of NRL for sending this in. It is additional discussion of the OpenMP constructs, MASTER and SINGLE. ]]

The MASTER and SINGLE constructs look similar but are significantly different. The MASTER construct is exactly equivalent to:


      if (omp_get_thread_num().eq.0) then
      ...
      endif

So only the master thread (thread 0) executes the contents of the construct and all other threads skip it. Note that there is no implied barrier, so whatever the master does to shared variables inside this construct is not necessarily "visible" to other threads until after a subsequent BARRIER (or other synchronization).

SINGLE is a worksharing construct, and all such constructs (and BARRIERs) must be encountered in the same order by all threads, or be encountered by no thread. There is no similar constraint on MASTER. Any one thread (chosen by the implementation) executes the contents of the SINGLE construct and all other threads skip it. By default END SINGLE implies BARRIER, so any changes made to shared variables inside this construct will be "visible" to all threads after END SINGLE. If you don't want a BARRIER, use END SINGLE NOWAIT. It is legal for a compiler to use MASTER to implement SINGLE, and the following would be identical in effect with such a compiler:


       !$omp single                   !$omp master
       ...                            ...
       !$omp end single               !$omp end master
                                      !$omp barrier


       !$omp single                   !$omp master
       ...                            ...
       !$omp end single nowait        !$omp end master

However, most compilers probably implement SINGLE by letting the first thread to arrive at the construct execute the contents. This will be faster than using the MASTER if the MASTER arrives later than the first thread. However, if the contents are inexpensive to compute the difference in performance between SINGLE and MASTER may be very small.

It is confusing that very similar looking directives have different behaviour in terms of BARRIER. I think OpenMP should always allow WAIT where NOWAIT can now be used, since then at least the default BARRIER could be documented by carefull programmers. This was turned down as an addition to the version 2.0 API, but with version 2.0 at least in-line comments will be allowed. So once 2.0 is available I suggest always documenting the barrier with an in-line comment, e.g.:


  !$omp end single !wait

UAF Colloquium Series: Jonah Lee, Feb 15

The UAF Department of Mathematical Sciences and ARSC are jointly sponsoring a Mathematical Modeling, Computational Science, and Supercomputing Colloquium Series.

The schedule and abstracts for the '00-'01 academic year are available at:

http://www.dms.uaf.edu/dms/Colloquium.html

The next presentation:

Computational Plasticity Dr. Jonah Lee Department Head Mechanical Engineering Department University of Alaska Fairbanks

Date: Thursday, February 15, 2001 Time: 1:00-2:00 PM Location: Chapman 106

ABSTRACT

Plasticity is a subfield of mechanics of materials whenever permanent deformations are involved which are usually precursors to the progressive damage and failure of materials. A few examples, drawn from recent research, will first be given on how computational plasticity is used at different geometric scales to gain a basic understanding of the behavior of materials. Software, language and platform issues will then be discussed. Finally, possible future directions will be discussed.

THE SPEAKER

Professor Jonah Lee has been with UAF for more than 17 years. He is currently Professor of Mechanical Engineering and an affiliate faculty with the Arctic Region Supercomputing Center. His research interests are in computational, theoretical and experimental mechanics of materials. He has received fundings from the National Science Foundation, DOD, Cray Inc. and other agencies for his projects.

ARSC Training, Next Two Weeks

For details on ARSC's spring 2001 offering of short courses, visit:

http://www.arsc.edu/user/Classes.html

Here are the courses available in the next two weeks:

ARSC Tour for New and Prospective Users

Wednesday, Feb 14, 2001, 3-4pm

Visualization with MAYA

Wednesday, Feb 21, 2001 2-4pm

Quick-Tip Q & A


A:[[ I'm tired of waiting ages and ages for my code to recompile when
  [[ I need to change its parameters.  Is there a way to run the code
  [[ again with new array sizes, constants, etc. that's any faster?


## Thanks go to Dr. Nic Brummel of the University of Colorado:

Well, in Fortran there is NAMELIST.  Since Fortran 90 allows dynamic
memory allocation, you can set array sizes as well as parameters.

       Program main

       Integer :: nx, ny, nz
       Real :: param1
       Namelist /problem_size_namelist/ nx, ny, nz, param1
       Open(9,file='input',status='old')
       Read(9, problem_size_namelist)
       Close(9)
       ...<do your stuff using nx,ny,nz,param1>...
       End Program main


<Contents of file "input">
 &problem_size_namelist
   nx     = 128
   ny     = 128
   nz     = 350
   x_max  = 6.
/
<end contents of file "input">



## Editor comments:

Other possibilities include command line arguments and reading
parameters from files, of one's own design, rather than standard
NAMELIST files.  Note that with dynamic allocation of arrays, the
compiler might miss opportunities for optimization which it would spot
with static arrays.



Q: Here's a warm-up exercise Guy gives the students in his class on
   parallel programming.  We thought it might be fun for this
   newsletter:

    Give an example of parallelism in the real world, and discuss
    briefly with respect to concurrency, scalability, locality.


    (For instance, cooking dinner. Yes, multiple cooks can work
    concurrently, but since too many would spoil the broth, it's not
    very scalable. Locality doesn't matter as long as the results
    appear at the same place in the right order.)

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top