ARSC T3E Users' Newsletter 149, August 21, 1998

Arctic Science Conference of the AAAS -- October Meeting

The 49th Arctic Science Conference of the American Association for the Advancement of Science will take place on the University of Alaska Fairbanks campus October 25-28, 1998. ARSC staff will be conducting tours of our facility daily at 1 pm during the conference. Details about the conference are available at:

http://www.gi.alaska.edu/aaas/index.html

or by sending email to fnmrf@uaf.edu .

ARSC encourages users to participate in the conference and to submit abstracts for poster sessions.

The 1998 AAAS Arctic Division annual conference has been designed around the theme of international cooperation in arctic research. It will provide a forum for scientists from around the world to come together to discuss important issues concerning global climate change and its impacts in the western arctic. It is widely accepted that such global change will be observed first in the arctic and sub-arctic regions, with serious implications for the rest of the world. The newly established International Arctic Research Center (IARC) will provide state-of-the-art facilities and opportunities for scientists to study these regions.

The conference format consists of two plenary sessions each day featuring internationally-known speakers, including keynote speaker Rita Colwell, Director-designate of the National Science Foundation. Between the morning and afternoon sessions each day there will be a poster session (abstracts invited). The conference will also serve as a Wadati Conference on Global Change; speakers for this conference are noted in the programs. In addition, tours of the IARC building and ARSC will provide participants with the chance to view some of the research facilities available at the University of Alaska Fairbanks.

Parallelizing Codes for the J90

Several users have asked about running in parallel on the J.

In short there are a number of easy ways to improve performance. You may use compiler options. Inspection of performance, using simple measurements or tools, also leads to optimization.

Compiling

The J90 system is a shared memory vector system. When looking to optimize code, it is necessary to consider both the vector operations and how many processors might work on the desired operations using a shared memory.

Cray originally supported both Macrotasking and Microtasking and the latest compilers merge both into the autotasking option in the compiler. (Macrotasking allowed the user to parallelize operations at the subroutine level, microtasking at the loop level.) Both used directives and the user was responsible for the correct execution of code. Autotasking is, as the name suggests, automatic and like microtasking exploits parallelism at the loop level.

The -O3 compiler option includes autotasking (this can also be set explicitly using -O and task2 or task3 , -O3 sets task2 level of optimization). As with all options, it is always worth applying this only to those expensive areas of code in which the most time is spent. Global application to an entire code is not advised. And, as with all optimizations, particularly those which perform major transformations, checking results against unoptimized code for correctness is strongly advised.

For diagnosing codes, a useful compiler option is -r . This prints out various information about your code, and provides information on vectorization and parallelization of the loops. For example, when used at the -r2 level in the following code the report includes the following:


  PROG1A    prog_do.f    14:42 Wed Aug12,1998
                                                                   Page 1
  ------
      3                                  program prog1a
      4
      5
      6                                  parameter (nsize=1024*1024*4)
      7
      8
      9                                  real a1,b1,c1,a2,b2,c2
     10
     11
     12                                  common /d1/
                                            a1(nsize),b1(nsize),c1(nsize)
     13
     14                                  common /d2/
                                            a2(nsize),b2(nsize),c2(nsize)
     15
     16
     17
     18
     19                                  iticks=irtc_rate()
     20
     21                                  write(6,*) ' data size is ',nsize
     22
     23             P-- v --------       do n=1,nsize
     24             P   v                  a1(n)=n*1.0
     25             P   v                  b1(n)=n*2.5
     26             P   v                  c1(n)=(nsize-n)*1.0
     27             P   v                  a2(n)=a1(n)
     28             P   v                  b2(n)=b1(n)
     29             P   v                  c2(n)=c1(n)
     30             P-- v ------->       enddo
     31
     32
     33                                  i4b=irtc()
     34

 PROG1A    prog_do.f    14:42 Wed Aug12,1998
                                                                   Page 2
 ------
     35             P-- v --------       do n=1,nsize
     36             P   v
                a2(n)=b2(n)*(b2(n)*(b2(n)*(b2(n)*(b2(n)+1)+1)+1)+1)+1
     37             P-- v ------->       enddo
     38
     39
     40
     41                                  i4c=irtc()
     42
     43                                  write(6,*)  ' best flops is
',9*nsize*(iticks()/(i4c-i4b))
     44                                  write(7,*) a2(nsize)
     45
     46
     47                                  end

This shows that the loops at lines 23 and 35 are both fully vectorized and parallelized. The -r option causes the compiler to print an additional summary report regarding these same loops:

f90 Compiler - 4 messages:


   1) <cf90-6403,Tasking,Line=23> A loop starting at line 23 was tasked.
   2) <cf90-6204,Vector,Line=23> A loop starting at line 23 was vectorized.
   3) <cf90-6204,Vector,Line=35> A loop starting at line 35 was vectorized.
   4) <cf90-6403,Tasking,Line=35> A loop starting at line 35 was tasked.

Inspection

Since the ARSC J90 system is shared by many users, and some timers report only the total CPU times, a user might observe is an increase in the total CPU time. This is because the same effort in terms of flops computed is performed, but management overhead to control several processors has been introduced!

In the example below, the timer, irtc , is used to get the wall clock time and to compute the actual Mflops (using manually counted loop operations). This is checked by comparing the totals with the count from the hardware performance monitor, hpm. Care is needed in choosing a suitable time basis and interpreting results.

The aim is to use parallelism to reduce the overall wall-clock time.

More Advanced Tools: atexpert

A graphical tool, atexpert, can be used to predict the parallel performance of a code. Compile with both -O3 and -eX and then run the code to generate performance measurements and then run atexpert to see how your code performs.


  chilkoot% f90 -O3 -eX -o prog_do_at prog_do.f
  chilkoot% ./prog_do_at
    data size is  4194304
    best flops is  113246208
  chilkoot% atexpert

This will generate a predicted performance graph, extrapolated from the data gathered by running the instrumented code on a single processor. The following graph (figure 1) is such a prediction, and was produced by atexpert:

Figure 1 (click on figure for larger image - 16 K )

As you can see, the simple example code below gives a very high level of parallel performance since it is perfectly parallel. Real codes will be more likely to decline rapidly and atexpert allow users to investigate the parallel performance of each subroutine so poorly performing areas of code can be improved.

One advantage of multiprocessor shared memory systems is that a performance improvement can be obtained by parallelizing only the computationally intensive parts of the code.

While there is much debate on the need to ensure that a single code doesn't hog the entire memory while only using one processor, partial parallelization can be beneficial to both the user running in parallel and other users who gain access to resources sooner. As with all optimization, users should try to concentrate effort on those routines which take the most time and perform relatively badly.

Applying compiler options is only a small step. Code modification might be needed and atexpert makes observations on the routines which are likely to be improved. Sometimes the algorithm must even be replaced.

An Example.

The following code is taken from one of my class examples on how to optimize code for the T3E. This is a simple, naturally parallel loop which both vectorizes and can be tasked across several processors.


      program prog1a


      parameter (nsize=1024*1024*4)


      real a1,b1,c1,a2,b2,c2


      common /d1/  a1(nsize),b1(nsize),c1(nsize)

      common /d2/ a2(nsize),b2(nsize),c2(nsize)




      iticks=irtc_rate()

      write(6,*) ' data size is ',nsize

      do n=1,nsize
        a1(n)=n*1.0
        b1(n)=n*2.5
        c1(n)=(nsize-n)*1.0
        a2(n)=a1(n)
        b2(n)=b1(n)
        c2(n)=c1(n)
      enddo


      i4b=irtc()

      do n=1,nsize
       a2(n)=b2(n)*(b2(n)*(b2(n)*(b2(n)*(b2(n)+1)+1)+1)+1)+1
      enddo



      i4c=irtc()

      write(6,*)  ' best flops is ',9*nsize*(iticks()/(i4c-i4b))
      write(7,*) a2(nsize)


      end

If this is compiled with -O2 we get:


  chilkoot% f90 -o prog_do_o2 -O2 prog_do.f
  chilkoot% hpm ./prog_do_o2
    data size is  4194304
    best flops is  113246208
  Group 0:  CPU seconds   :       0.58      CP executing     :      116900662
  
  Million inst/sec (MIPS) :       6.86      Instructions     :        4011169
  Avg. clock periods/inst :      29.14
  % CP holding issue      :      95.45      CP holding issue :      111586438
  Inst.buffer fetches/sec :       0.00M     Inst.buf. fetches:           1796
  Floating adds/sec       :      57.41M     F.P. adds        :       33554657
  Floating multiplies/sec :      35.88M     F.P. multiplies  :       20971910
  Floating reciprocal/sec :       0.00M     F.P. reciprocals :              1
  Cache hits/sec          :       0.01M     Cache hits       :           5527
  CPU mem. references/sec :      57.49M     CPU references   :       33602005
  
  Floating ops/CPU second :      93.29M
  chilkoot%

Moving up to -O3 we get:


  chilkoot% f90 -O3 -o prog_do_o3  prog_do.f
  chilkoot% hpm ./prog_do_o3
    data size is  4194304
    best flops is  490733568
  Group 0:  CPU seconds   :       0.60      CP executing     :      119366230
  
  Million inst/sec (MIPS) :       7.85      Instructions     :        4687445
  Avg. clock periods/inst :      25.47
  % CP holding issue      :      94.74      CP holding issue :      113082842
  Inst.buffer fetches/sec :       0.00M     Inst.buf. fetches:           2207
  Floating adds/sec       :      56.22M     F.P. adds        :       33554658
  Floating multiplies/sec :      35.14M     F.P. multiplies  :       20971911
  Floating reciprocal/sec :       0.00M     F.P. reciprocals :              1
  Cache hits/sec          :       0.01M     Cache hits       :           3462
  CPU mem. references/sec :      56.34M     CPU references   :       33628114
  
  Floating ops/CPU second :      91.36M

This happens because the default value for NCPUS on chilkoot is 4 . If we should set NCPUS to 3 the performance is:


  chilkoot% setenv NCPUS 3
  chilkoot% hpm ./prog_do_o3
    data size is  4194304
    best flops is  377487360
  Group 0:  CPU seconds   :       0.59      CP executing     :      117534552
  
  Million inst/sec (MIPS) :       7.96      Instructions     :        4680314
  Avg. clock periods/inst :      25.11
  % CP holding issue      :      94.69      CP holding issue :      111291466
  Inst.buffer fetches/sec :       0.00M     Inst.buf. fetches:           2074
  Floating adds/sec       :      57.10M     F.P. adds        :       33554658
  Floating multiplies/sec :      35.69M     F.P. multiplies  :       20971905
  Floating reciprocal/sec :       0.00M     F.P. reciprocals :              1
  Cache hits/sec          :       0.01M     Cache hits       :           3462
  CPU mem. references/sec :      57.21M     CPU references   :       33621009
  
  Floating ops/CPU second :      92.78M
  chilkoot%

Note the actual elapsed time is as follows:


  chilkoot% env NCPUS=4 time ./bprog_do_o3
             seconds          clocks
  elapsed   12.76160      1276159892
  user      46.45744      4645744272
  sys        1.18272       118272061

  chilkoot% env NCPUS=3 time ./bprog_do_o3
             seconds          clocks
  elapsed   16.70126      1670126109
  user      46.99182      4699182046
  sys        0.89066        89066264

  chilkoot% env NCPUS=2  time ./bprog_do_o3
             seconds          clocks
  elapsed   24.13021      2413021230
  user      46.36825      4636825171
  sys        0.73992        73992258

  chilkoot% env NCPUS=1 time ./bprog_do_o3
             seconds          clocks
  elapsed   46.82797      4682797159
  user      46.19941      4619941115
  sys        0.44435        44435070

The above shows good speedup in the wall clock times. However these results took advantage of a time when the system was relatively idle. Trying a large number for NCPUS during the same, relatively idle, period does not yield such good speedups since, unlike the T3E and other MPP systems, processors are not dedicated to users but are shared by both system and other user activities.


  chilkoot% env NCPUS=12 time ./bprog_do_o3
             seconds          clocks
  elapsed    5.58927       558926937
  user      45.74897      4574896788
  sys        1.70961       170961040
  
  chilkoot% env NCPUS=8 time ./bprog_do_o3
             seconds          clocks
  elapsed    6.58672       658671930
  user      46.48904      4648903562
  sys        0.74683        74682794

Note that when trying to use large number of processors on an active system, contention with other users results in less than optimal speedups. On the ARSC J90, which has 12 processors, users are limited to a maximum of 4 processors at present. Users should determine which numbers of processors actually give the best performance and setenv NCPUS in scripts etc.

Documentation.

There is introductory documentation on autotasking in the manual pages for the f90 compilers and the atexpert tool online. Atexpert contains a demo which is worth looking at in terms of the different outputs possible. A Cray document SR-2182, A Guide to Parallel Vector Applications, is also good reading on the various tools and can be found online at the ARSC web site for access by ARSC users.

Center for Research on Parallel Computation -- Newsletter

> The Spring/Summer 1998 issue of Parallel Computing Research, the > quarterly newsletter of the Center for Research on Parallel > Computation, is now available at: > > http://www.crpc.rice.edu/CRPC/newsletters/sum98/ > > Previous issues and articles can be found at: > > http://www.crpc.rice.edu/CRPC/newsletters/index.html . > > If you have any difficulties accessing materials, please contact Kathy > El-Messidi at elmessy@rice.edu. If you do not have a Web browser, write > Kathy at the same address to request specific articles from the list > below this message. > > To subscribe or unsubscribe, mail requests to pcr@cs.rice.edu .

Co-Array Fortran Paper Available

[ We received the following announcement from John Reid. ]

The paper defining Co-Array Fortran is available, and will be published in the next issue of Fortran Forum.

This is the abstract:

Co-Array Fortran, formerly known as F--, is a small extension of Fortran 95 for parallel processing. A Co-Array Fortran program is interpreted as if it were replicated a number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is termed an image. The array syntax of Fortran 95 is extended with additional trailing subscripts in square brackets to give a clear and straightforward representation of any access to data that is spread across images.

References without square brackets are to local data, so code that can run independently is uncluttered. Only where there are square brackets, or where there is a procedure call and the procedure contains square brackets, is communication between images involved.

There are intrinsic procedures to synchronize images, return the number of images, and return the index of the current image.

We introduce the extension; give examples to illustrate how clear, powerful, and flexible it can be; and provide a technical definition.

Significant recent changes involve synchronization and I/O.

The paper is available as the report RAL-TR-1998-060, by ftp from: matisa.cc.rl.ac.uk

in the file: pub/reports/nrRAL98060.ps.gz

Parallel Programmer Sought for Position at UAF

> > POSTDOCTORAL FELLOW > PETROLEUM DEVELOPMENT LABORATORY > UNIVERSITY OF ALASKA FAIRBANKS > > The Petroleum Development Laboratory at the University of Alaska > Fairbanks is seeking a postdoctoral fellow with an interest in > massively parallel computations and large-scale petroleum reservoir > simulations. The candidate will participate in NSF-funded parallel > programming research projects developing parallel data and computation > on distribution algorithms, computer-aided parallelizers for numerical > simulation codes and apply parallel codes to multi-million grid-cell > reservoirs on high performance computers (CRAY T3E, clusters of > Workstations). > > A Ph.D. degree in Applied Mathematics, Computational Physics, Computer > Science, Fluid Mechanics, Engineering or closely related field is > required. Candidate must demonstrate ability to write parallel > computational codes in C and FORTRAN. Preference will be given to > candidates with experience in PVM and MPI. The appointment will be > initially for 1 year, beginning September 1, 1998, and may be renewable > for up to 2 years depending on availability of funds. > > > Send an updated curriculum vitae and names and phone numbers of three > references to Professor David O. Ogbe, 437 Duckering Building P.O. Box > 755880 University of Alaska Fairbanks, Alaska 99775-5880. Fax: (907) > 474-5912 Email: ffdoo@uaf.edu >

Quick-Tip Q & A


A: {{ How do I determine which version of f90 I am using? }}

  f90 -V
 
  "-V" works for other products as well:        

  pghpf -V
  cc -V
  CC -V 
  


Q: ARSC permits of the "chaining" of NQS jobs, as long as the 
   new job goes to the end of the queues. This can increase system
   utilization.

   Put another way, at the end of your qsub script, you may include a
   qsub call which submits your next job--provided that jobs in all
   other queues, including lower priority queues, get a chance to run
   first.
 
   What is a safe method to implement such chaining which is fair to
   other users?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top