ARSC HPC Users' Newsletter 347, September 01, 2006

Fall 2006: ARSC User Training

Once again, ARSC users have a tremendous variety and depth of training opportunities. Here are the topics this fall:

  • Introduction to ARSC & HPC
  • Introduction to Unix
  • Data Management at ARSC
  • Writing Batch Scripts
  • Data Visualization
  • Introduction to Fortran
  • Debugging
  • Validation & Verification
  • Performance Programming
  • Parallel Shared Memory Programming
  • Parallel Distributed Memory Programming
  • Space Plasma Physics Applications

ARSC user training is offered in conjunction with "Core Skills For Computational Science," taught jointly by the UAF Physics Department and ARSC.

This *IS* ARSC's Fall User Training. You are encouraged to drop in on any lecture of interest. Here's the complete training schedule:

http://people.arsc.edu/~cskills/schedule.shtml And here's the primary training web site:

http://people.arsc.edu/~cskills/

Contact Tom Logan (logan AT arsc.edu) with questions.

How to Beat Moore's Law: An Optimization Story in 6 Acts--Part I

[ Many thanks to Tom Logan of ARSC for this two-part thriller! ]

ACT I Analysis Shows I'm Doomed
ACT II Depression Sets In
ACT III Relying on Moore
ACT IV Take Your Own Advice
ACT V Directives Save The Day
ACT VI Conclusions of the Super-Linear Kind

ACT I: Analysis Shows I'm Doomed

I was recently faced with a problem that often comes up in the scientific computing realm: the tsunami model that I was working with was too slow. I needed to run a job for 24780 iterations (time steps). Not realizing this was about 10 times longer than any of the test runs previously made, I started the job up on one p655 node on Iceberg (the code is serial) and waited for the results.

What I got was very discouraging. In the eight hours allowed in the standard queue on Iceberg, the code only completed 4050 iterations. This worked out to about 8.4 iterations per minute. A quick calculation showed me that I was most certainly doomed, since the full run would take roughly 49 hours to complete and, while the "single" queues at ARSC would permit such a long run, it would be impractical for the desired test and production work.

ACT II: Depression Sets In

Since I wanted to get these runs done in a timely fashion, I ruled out any significant code changes. For instance, trying to modify the code to write a restart file would be too time consuming. Writing a parallel version of the code using MPI would be a serious time sink, not to mention that these types of codes (many many iterations on relatively small grids) are not the best candidates for message passing algorithms.

I thus turned to the compiler to help with my dilemma. At this point, my compiler flags looked like this:


  LDFLAGS = -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 
  FFLAGS = -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 -qmaxmem=-1
  FFLAGS_1 = -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 -qmaxmem=-1 -qsuffix=f=f90

Where the FFLAGS are for the .f files while the FFLAGS_1 are for the .f90 files. I already had the code tuned for the architecture and at the highest level of optimization provided.

This left my next alternative as trying to use the IBM compiler's built-in auto-parallelization. Not having had much luck with this in the past, I was not optimistic. Sure enough, simply adding the -qsmp=auto switch to my compiler flags and setting the environment variable OMP_NUM_THREADS=8 in my loadleveler script bought me nothing. I was still getting roughly 8.3 iterations per minute.

To facilitate testing, I reduced the run to only 1000 iterations, or approximately 2 hours of run time.

ACT III: Relying on Moore

Next I had what I thought was a brilliant idea! We've got these new power5 nodes on Iceflyer. Maybe that'll work - make Moore's law work for me by using a bigger/better/faster machine! So, that's what I did. I moved the code to iceflyer and compiled it with some slightly modified flags:


  LDFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64
  FFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 
  FFLAGS_1 = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 -qsuffix=f=f90 

Well, I got what Moore said I should expect. A bit less than a 2 times speedup, bringing the time down to 66 minutes for 1,000 iterations. Taking advantage of the 16 hour queue time in the p5 queue, I re-ran the entire simulation only to have the job time out after completing 15000 iterations. Close, but no cigar.

What followed was many failed attempts at slight variations. I tried auto parallelization using 4 or 8 threads. Iterations per minute improved from 15.15 (serial) to 15.38 (4 threads) to 15.63 (8 threads). Once again, virtually no gains were realized from auto-parallelization.

I also tried AIX 5.3. Since it has support for simultaneous multi-threading, I could use up to 16 threads on a single 8-processor node. Alas, the times were pretty much exactly the same as on the AIX 5.2 nodes.

ACT IV: Take Your Own Advice

Finally, I took the advice that I give in all of my classes. Start by profiling your code. See what kind of optimizations are possible. I changed up my compilations flags a bit, adding -pg -g to turn on profiling with symbol tables and adding -qreport -qsource -qlist to get full compilation reports for the code:


  LDFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qsmp=auto -qreport \
                  -qsource -qlist -pg -g -qfullpath
  FFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 -qsmp=auto \
                 -qreport -qsource -qlist -pg -g -qfullpath
  FFLAGS_1 = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 \
                   -qsuffix=f=f90 -qsmp=auto -qreport -qsource -qlist -pg \
                   -g -qfullpath

When the run of this compilation was complete, I had my gmon.out tracefile which I readily processed through grpof using:


  % gprof > gprof_pre_opt.out

Wading through the nearly 2000 lines of output, I found (somewhere near the bottom of the file) the following report:


  Time: 3891.28 seconds

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 90.9    3536.56  3536.56     1001  3533.03  3533.03  .momt_s [3]
  5.2    3738.03   201.47     1001   201.27   201.27  .mass_s [4]
  1.1    3782.05    44.02     1001    43.98    43.98  .change [5]
  0.7    3810.05    28.00     1001    27.97    27.97  .minmax [6]
  0.4    3825.69    15.64 144108020     0.00     0.00  ._log [11]
  0.3    3836.97    11.28 172929624     0.00     0.00  .scalb [12]
  0.3    3847.70    10.72                             .__mcount [13]
  0.2    3856.30     8.60 86464812     0.00     0.00  ._atan2 [10]
  0.1    3860.81     4.51                             ._power_logb [15]
  0.1    3865.25     4.44 28821604     0.00     0.00  .udcal [14]
  0.1    3868.66     3.41        1  3410.00 25943.51  .deform_smylie [8]
  0.1    3871.34     2.68     1001     2.68     2.68  .open [16]
  0.1    3873.37     2.03                             .memmove [17]
  0.1    3875.37     2.00                             ._cosh [18]
  .
  .
  .

So, 90.9% of the execution time was being spent in the routine momt_s, which calculates momentum in spherical coordinates. I next looked at the .lst file created during compilation. Since this file had over 98,000 lines in it, I searched for momt_s and found:


>>>>> SOURCE SECTION <<<<<
        686 

        687 

        688 
!-----------------------------------------------------------------
        689 
      subroutine momt_s (l,layid)
        690 
! ....Solve momentum equation (linear) in spherical coord.
        691 
!     layid = 1, outest layer
        692 
!     otherwise, inner layer
        693 
!-----------------------------------------------------------------
        694 
      use layer_params
        695 
      type (layer)         :: l
        696 
     integer       :: layid
        697 
!      real z(ix,jy,2),m(ix,jy,2),n(ix,jy,2),h(ix,jy)
        698 
!      real r2(ix,jy),r3(ix,jy),r4(ix,jy),r5(ix,jy)
        699 
      data eps/1.0e-6/, zero/0.0/, twlvth/0.08333333333333/
        700 
!
        701 
      ixm1 = l%ix-1
        702 
      jym1 = l%jy-1
        703 
      is = 2
        704 
      js = 2
        705 
      if (layid .eq. 1) then
        706 
   is = 1
        707 
   js = 1
        708 
      end if
        709 
      do i=is,ixm1
        710 
        ip1 = i+1
        711 
        do j=js,l%jy
        712 
          if ((l%h(i,j).gt.zero) .and. (l%h(ip1,j).gt.zero)) then
        713 
            if (j .le. js) then
        714 
              jm1 = js
        715 
            else
        716 
              jm1 = j-1
        717 
            endif
        718 
            if (j .ge. l%jy) then
        719 
              jp1 = l%jy
        720 
            else
        721 
              jp1 = j+1
        722 
            endif
        723 
            tot_n = l%n(i,j,1)+l%n(ip1,j,1)+l%n(i,jm1,1)+ &
                                                         l%n(ip1,jm1,1)
        724 
            xm = l%m(i,j,1)-l%r2(i,j)*(l%z(ip1,j,2)-l%z(i,j,2))+ &
                                                l%r3(i,j)*tot_n-l%r2(i,j)*twlvth*((l%z(ip1,jp1,2)- &
                                                2*l%z(ip1,j,2)+l%z(ip1,jm1,2))-(l%z(i,jp1,2)-2* &
                                                l%z(i,j,2)+l%z(i,jm1,2)))
        725 
            if (abs(xm) .lt. eps) xm = zero
        726 
            l%m(i,j,2) = xm
        727 
          else
        728 
       l%m(i,j,2) = 0.0
        729 
          end if
        730 
        end do
        731 
      end do
        732 
!
        733 
      do j=js,jym1
        734 
        jp1 = j+1
        735 
        do i=is,l%ix
        736 
          if ((l%h(i,j).gt.zero) .and. (l%h(i,jp1).gt.zero)) then
        737 
            if (i .le. is) then
        738 
              im1 = is
        739 
            else
        740 
              im1 = i-1
        741 
            endif
        742 
            if (i .ge. l%ix) then
        743 
              ip1 = l%ix
        744 
            else
        745 
              ip1 = i+1
        746 
            endif
        747 
            tot_m = l%m(im1,j,1)+l%m(im1,jp1,1)+l%m(i,j,1)+ &
                                                l%m(i,jp1,1)
        748 
            xn = l%n(i,j,1)-l%r4(i,j)*(l%z(i,jp1,2)-l%z(i,j,2))- &
                                                l%r5(i,j)*tot_m-l%r5(i,j)*twlvth*((l%z(ip1,jp1,2)- &
                                                2*l%z(i,jp1,2)+l%z(im1,jp1,2))-(l%z(ip1,j,2)-2* &
                                                l%z(i,j,2)+l%z(im1,j,2)))
        749 
            if (abs(xn) .lt. eps) xn = zero
        750 
            l%n(i,j,2) = xn
        751 
          else
        752 
            l%n(i,j,2) = 0.0
        753 
          end if
        754 
        end do
        755 
      end do
        756 
!
        757 
      return
        758 
      end
** momt_s   === End of Compilation 12 ===


Source      Source      Loop Id       Action / Information
File        Line
--------    --------    -------   ----------------------------------------------
 0           709          1       Loop cannot be automatically parallelized.  A
                                  dependency is carried by variable aliasing or
                                  function call.
 0           711          2       Loop cannot be automatically parallelized.  A
                                  dependency is carried by variable aliasing or
                                  function call.
 0           733          3       Loop cannot be automatically parallelized.  A
                                  dependency is carried by variable aliasing or
                                  function call.
 0           735          4       Loop cannot be automatically parallelized.  A
                                  dependency is carried by variable aliasing or
                                  function call.

So 90.9% of my run time is spent in a routine that the compiler will not automatically parallelize for me. What to do...

...don't miss the thrilling conclusion in the next newsletter:

ACT V Directives Save The Day
ACT VI Conclusions of the Super-Linear Kind

Java for Fortran Programmers: Part I

[[ Thanks to Lee Higbie of ARSC for this tutorial. ]]

This is the first in a series of articles, presented as a tutorial, for scientists and engineers. Some knowledge of C is useful, but I will not assume that you know C++ or any other object oriented language.

My planned tutorial outline is:
  • Java's Uses for the Scientific and Engineering Community
  • Object Oriented Programming (OOP)
  • How the OOP mindset differs from that usual for Fortran programmers
  • How the OOP syntax differs from that of Fortran and C
  • Interfacing Java and Fortran programs
  • Creating Java programs
  • Example

How far and deep I go will depend on feedback. If this topic interests you, let me or one of the editors know!

This initial part of the tutorial is expected to interest new scientific and engineering programmers or programming managers, those considering a new project and wondering if Java might be a good choice. After this initial background, the material will become more technical and should interest programmers who are starting to learn Java or have picked up a little in the past.

Java's Uses for the Scientific and Engineering Community

Java is easy to use, but it has a steep learning curve if you've never used an object oriented programming language. OOPs require a different mindset from that for imperative languages (like Fortran and C). Unlike C++, where it is easy to write a conventional (imperative) program by only using C, Java is more aggressively object oriented--even HelloWorld uses an object.

In our world Java is especially suited for GUIs and support programs and I doubt I'll see it used for a major, computation-intensive application. Though unsuited for heavy computational work, Java is a well designed OO language with many good features. Some are:

  1. It includes an automatic documentation system. Stylized comments can be used to describe parts of a code and the documentation is automatically generated from them and the code.
  2. The are several large libraries of GUI widgets that allow control programs to interact visually with users.
  3. It is highly portable. With minimal care applications can be written that will run on most platforms.
  4. It was designed from the beginning for applets, programs that run in web browsers. An applet allows the user to safely run a program from a workstation.
  5. It has a built-in structure for creating and vetting exceptional conditions. A method can create an exception and force its users to deal with the exception.
  6. It has built in functionality and syntax to eliminate many of the problems that crop up in C++ programs (memory leaks, wandering pointers, weak typing, implicit type conversions, ...)

(I've measured a 4:1 slowdown when doing simple array computations in Java instead of Fortran. Carbon-based systems[*] react slowly so it works well for interacting with them.)

Object Oriented Programming (OOP)

So what is an object oriented programming language? The four defining characteristics of OOPs are:

  1. Encapsulation A single block of code, called a class, defines a data structure and the procedures for operating on it, called methods. Classes often include methods and variables that are hidden from users, which facilitates changing algorithms or code without users of the class knowing about it.
  2. Inheritance A class can include the data and variables of a parent class. This is especially useful for libraries and is an important concept to understand. For example, PopupMenu extends Menu extends MenuItem extends MenuComponent extends Object. This means that the methods for adding items to a PopupMenu are not recoded but are taken exactly from Menu and the event handling methods of MenuItem are directly inherited by any PopupMenu, and so on.
  3. Polymorphism . Methods (functions) can be called with a variety of arguments. The number and type of arguments is not constrained. In object-oriented languages it is common for
    1. A method to set some default parameters then call the general version of the method
    2. For inherited methods (methods from a class being extended) to provide variants that accept different arguments
    3. For a method to convert the argument types and call the general version of the method
  4. The basic unit of code is a class, which encapsulates a data structure and the methods for working with it. For example, the String class includes almost two score methods of its own, it inherits another half dozen from Object, the ultimate parent of all classes, and has polymorphic variants on many of these methods. The emphasis of an OOP is on the classes and their data, not on the flow of logic or control.

There is one more bit of basic OOP terminology that is needed to discuss OOP programs. As mentioned, a class is the code that describes a data structure and includes the methods (functions) for operating on it. The actual data structure is called an object, but don't confuse this with the class Object (upper case oh), which is the ultimate parent of all Java classes. Just as you might have dozens of strings in a program, in an OOP you may have dozens of string objects, each of which is an instance of the String class for a Java application.

So, how does Java measure up? It has all of these characteristics but also has basic, non-object data. Logical, various types of integers, floating point and character data are available facilitating basic imperative programming. In Part II, I will provide an example to illustrate the basic parts of code.

This article has described some of the places where scientific and engineering programmers might apply Java in their work. I have introduced the top level of OOP terminology. I'll recap with a dictionary translating the Fortran terminology used here to Java.

Fortran term Java term Explanation
function method parameters passed by value, polymorphism rampant
structure declaration class class includes code, usually one to a file
structured variable object (small oh) object also owns all its class's methods
subroutine method with void type (no returned value)
type conversion cast syntax--use type in parens: x = (real) i;

This article has covered the first two tutorial topics. We'll pick it up again with:

  • How the OOP mindset differs from that usual for Fortran programmers

--

[*] Footnote: "Carbon-based systems": a euphemism for people. Those unfamiliar with this term are referred to Star Trek, where, I think, the Borg referred to the astronauts as a carbon-based infestation.

Quick-Tip Q & A


A:[[ I am writing a script which looks at the extension of a file.  
  [[ So far I'm not too committed to a particular scripting language.  
  [[ Is there an easy way to get the extension of a file without 
  [[ using sed!  

  # 
  # Lorin Hochstein
  # 
  In tcsh, the ":e" variable modifier will extract the extension of a
  file. Also useful: the ":r" modifier will extract the name without the
  extension.
  
  $ set x="filename.txt"
  $ echo $x:e
  txt
  $ echo $x:r
  filename
  
  # 
  # Harper Simmons
  # 
  using csh/tcsh (I know, I know, uncool)
  
  set a = roo.dat
  
  set ext = $a:e
  echo $ext
  produces "dat"
  
  # 
  # Ryan Czerwiec
  # 
  For csh/tcsh this will work (there will be a similar answer for
  sh/bash/ksh): 
  
  If your filename is stored in the variable "file,"
  then the extension "ext" can be obtained with:
      set ext = `echo $file 
 tr "." " "`
  This will create an array where the extension is the last element,
  or ext[$#ext]. This can also be useful if you need to reassemble the
  filename with a different extension, for example.
  
  This version uses less memory (it doesn't create an array), but it's
  a little slower:
      set ext = `echo $file 
 tr "." "\n" 
 tail -1`
  You can do it a little more simply if you happen to know that all of
  your filenames will have the same number of "." characters in them:
      set ext = `echo $file 
 cut -d'.' -f2`
  where the example of -f2 is for a file with one "." character.  Use a
  number one higher than the number of dots as long as that number is
  fixed (you can use a variable for it, too, as in -f$num).

  # 
  # One Editor:
  # 
  You can use "expr" regular expressions.  E.g.:

    $ expr this.is.a.test : ".*\.\(.*\)"
    test

  # 
  # Other Editor:
  # 
  I would use one of the bash pattern matching operators to do this.

    ${val##pattern} 

  This operator does the following:  If pattern matches the beginning
  of the variable $val it deletes the longest part that matches then
  returns the rest of the string.

  So the following pattern will return the extension as long as there
  is a least one dot in the filename.

    ${val##*.}

  If the filename might not have a dot in it, we can check for that
  using grep:

    for f in *; do 
      if [ ! -z "$(echo $f 
 grep "\." )" ]; then  
         echo ${f##*.}; 
      fi
    done

  Alternately, we can eliminate the grep by ensuring there is a dot
  in the filename.  E.g.:

    for f in *.*; do 
      echo ${f##*.}; 
    done






Q: Here's a conditional statement grabbed from the (/bin/sh) 
   configure script for mysql.  There are many like this:

       if test X"$mysql_cv_compress" != Xyes; then 
           # ...do stuff...
       fi
  
   For my scripts, the following style has always worked:

       if [[ $mysql_cv_compress != yes ]]; then 
           # ...do stuff...
       fi
  
   So, two questions: 
     1) Why would the experts use "test" rather than the square bracket
        syntax?
     2) Why bother with that "X" ??? 

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top