ARSC HPC Users' Newsletter 366, July 26, 2007

ARSC Symposium on Multicore and New Processing Technologies

A symposium exploring the capabilities and use of new high performance computing (HPC) technologies is taking place Aug. 13-14 on the campus of the University of Alaska Fairbanks in conference room 010 of the West Ridge Research Building. Space is limited for this free symposium and registration is required. Register here:

http://www.arsc.edu/support/training/MulticoreWebForm2007.html

The two-day symposium is hosted by the Arctic Region Supercomputing Center, a national leader in the ongoing evaluation of current and next-generation processing technologies to improve the speed, memory functions and efficiencies of supercomputers. ARSC is a charter member of the National Science Foundation's Center for High Performance Reconfigurable Computing (CHPRC). For more details, see:

http://www.arsc.edu/support/training/MulticoreSymposium2007.html

Taking Slices of Data From NetCDF Files: Part I

[ by Anton Kulchitsky and Lee Higbie]

If you want to work on a slice or column of data from a netCDF file, the netCDF interface allows you to extract data without reading the whole array. There can be confusion if you would like to read data into an array of smaller dimension than the original variable. The goal of this article is to show how to extract precisely the information you want from a netCDF file.

We will use a modification of the netCDF tutorial's error function in our examples, and, this week, use C:


  /* just for errors in netCDF (modernized version from netCDF tutorial) */
  static void handle_nc_error(int status) 
  {
    if (status != NC_NOERR) {
      fprintf( stderr, "NetCDF Error %d: %s\n", status, nc_strerror(status) );
      exit( status );
    }
  }
  1. Getting a 1D Column From 4D Data.

    Assume you already know all the dimensions T, Z, Y, X of your 4D array, A (you can get them using the nc_inq_dimid and nc_inq_dimlen functions). Consider the task of extracting a 1D vertical column A(t,z,y,x) for fixed index values, x=x0, y=y0, t=t0 into a variable float a[] with enough memory allocated for the column. The following code performs the task (see a detailed explanation below the code):

    
      /* getting variable id (in parr) */
      int parr;
      int status = nc_inq_varid (nc_file, A, &parr); /* A is the name 
                  of the variable into netCDF as mentioned above */
      if (status != NC_NOERR) handle_nc_error(status);
    
      /* getting values */
      const size_t start[4] =     { t0, 0, y0, x0 };
      const size_t count[4] =     { 1,  Z, 1,  1 };
      status = nc_get_vara_float(nc_wrffile, parr, start, count, a);
      if (status != NC_NOERR) handle_nc_error(status);
    

    First, the code retrieves the variable id using the function nc_inq_varid with a file id of nc_file and variable name A into the variable, parr. The variable itself can be retrieved with nc_get_vara_float function. It uses the arrays start and count of dimension 4 (remember, we are working with a 4D array) to describe the required slice.

    In this example, we fix the x, y, and t indices with the values x0, y0 and t0 and allow z to range from 0 to Z-1. (If you need only part of the z-column, you can modify that range as well.) Thus, array start indicates the corner of the array from which you read the variable, and array count indicates how many elements are to be read in each dimension.

    It is very easy to modify the code if you need to read just one value from the 4D array, using index (t0,z0,y0,x0). The only coding change would be for the arrays start and count :

    
      const size_t start[4] =     { t0, z0, y0, x0 };
      const size_t count[4] =     { 1,  1, 1,  1 };
    

    And in this case, of course, it would be sufficient to allocate memory for only one float, not an entire array.

  2. Getting a 2D slice from 4D data.

    The same idea works for extracting a 2D slice of data and storing it to matrix b :

    
      float b[Y][X];
    
      /* getting variable id */
      int parr;
      int status = nc_inq_varid (nc_wrffile, 
    A
    , &parr);
      if (status != NC_NOERR) handle_nc_error(status);
    
      /* getting values */
      const size_t start[4] =     { 0, z0, 0, 0 };
      const size_t count[4] =     { 1, 1,  Y, X };
      status = nc_get_vara_float(nc_wrffile, parr, start, count, &b[0][0]);
      if (status != NC_NOERR) handle_nc_error(status);
    

    Note that nc_get_vara_float accepts a pointer to b[0][0] , not b itself.

    You can use the same ideas to extract data from arrays of different dimensions.

--

To be continued. Next time, in Fortran...

Highlights from SCICOMP-13, the IBM Scientific Users Group

[ by: Tom Logan ]

This year's annual IBM scientific users group conference, SCICOMP-13, was hosted by IPP (Max Planck Institute for Plasma Physics) in Garching, Germany. Garching is a quaint little city situated just 15 short subway minutes north of Munich. It is the location of the Max Planck Institute that incorporates many disciplines, although it was originally primarily focused on plasma physics research.

The 4 1/2 day conference included a 1-day IBM tutorial, twenty two user submitted talks, and eight additional IBM talks. As always, the tutorial was most appropriate for new IBM users, but included tidbits for the experienced user as well. The additional IBM talks covered the future road maps for architectures, loadleveler, the ACTC toolkits, GPFS, BlueGene, cell processors, and, of course, the compilers.

Here are some of the better quotes from the user talks:

  • "The clock-speed race is over, the core race has begun."
  • "As cores start to double every 18 months, memory bandwidth will halve every 18 months."

    [A few power5 vs power5+ talks reinforced this concept. Since power5+ is dual core and power5 is not, it was interesting to note that many memory intensive applications actually performed better on the power5 architecture.]

  • "Flops rates are not a useful measure of science achieved."

    [To back this up, we saw that 1/2 of a set of 19 well-established scientific codes got <10% of the given system's peak theoretical performance on power5+, and that the average application does only ~50% floating point operations]

  • "Mixed-mode programming rarely performs better than MPI."

    [I'll be honest, this is actually quoting myself. I attended yet another talk demonstrating that pure MPI performed the best, which reaffirmed this conclusion. However, in a side conversation with another attendee, I was assured that at least one production code at NCAR works best in mixed-mode.]

Overall, the conference contained a large amount of wonderful information for the IBM scientific user. Group discussions during the breaks and at lunch added greatly to the usefulness of the the meeting. And, of course, the late night meetings at the German beer gardens were the icing on the cake!

A complete list of the user talks, including "MPI-IO and MPI-2 One-Sided Communication" by yours truly, can be found at:

http://www.spscicomp.org/ScicomP13/Presentations/User/

I've been assured that the IBM talks will be posted within the next week or so.

Removal of SSH default usage protocol 1

[ by: Oralee Nudson ]

Secure Shell (SSH) is a protocol used to exchange data over the network using a secure channel. There are two versions of the SSH protocol in use, versions 1 and 2. Protocol 1, originating from Finland in 1995, offered a more secure alternative to popular network protocols of the time such as rlogin, telnet, and rsh (wikipedia.com). Supporting the features of user authentication through the use of passwords and encryption, SSH protocol 1 quickly became a standard in network communication. Just one year later, SSH protocol 2 was made available and offered more security features not supported by protocol 1. Using the Diffie-Hellman key exchange and allowing a user to run more than one shell session over a single SSH connection are just two of the many significant improvements protocol 2 offered over protocol 1.

An excellent side-by-side comparison of SSH protocols 1 and 2 is presented on the following snailbook webpage:

http://www.snailbook.com/faq/ssh-1-vs-2.auto.html

In an ongoing effort to promote the use of more secure connections to and from ARSC systems, on August 1, 2007 SSH protocol 1 will be removed as a default usage protocol on all systems. Protocol 2 will become the default protocol. The discontinuation of protocol 1 as a default usage protocol does not mean that its use will be completely removed. Instead, should a user need to use SSH protocol 1 to connect to and from ARSC systems, they should add the following line to their ~/.ssh/config file:

Protocol 2,1

We are expecting this change in default usage protocols will be transparent to most users. However, it may be beneficial to double-check the SSH protocol settings of the client you use to connect to ARSC systems. For example, if you use Putty, confirm that the "Preferred SSH protocol version" radio button selected is "2" or "2 only" within the Putty Configuration SSH settings.

Users who would like less information are welcome to read "news ssh" on any ARSC system. As always, please feel free to direct any questions to the ARSC help desk.

Quick-Tip Q & A


A:[[ Can someone please explain this?  When I use over 30,000,000 random 
  [[  variables in my code, the sum and average get out of whack.
  [[  Below is a sample code which shows the problem.
  [[  
  [[  mg56% cat t2.f90
  [[  program t2
  [[  
  [[    integer :: p, N
  [[    real, allocatable, dimension (:) :: a
  [[  
  [[    do N=10000000,50000000,5000000
  [[      allocate (a(1:N))
  [[  
  [[      call random_seed ()
  [[      call random_number (a)
  [[  
  [[      print*, "Count:", N, "Sum:", sum (a), "Avg:", sum(a)/real(N), "Min:"  &
  [[        , minval(a), "Max:", maxval(a)
  [[  
  [[      deallocate (a)
  [[    enddo 
  [[  
  [[  end program
  [[  
  [[  mg56% pathf90 t2.f90
  [[  mg56% ./a.out
  [[   Count: 10000000 Sum: 5000621.5 Avg: 0.500062168 Min: 2.980232239E-7 Max: 0.999999702
  [[   Count: 15000000 Sum: 7499592.5 Avg: 0.49997282 Min: 0.E+0 Max: 0.999999881
  [[   Count: 20000000 Sum: 10002506. Avg: 0.500125289 Min: 0.E+0 Max: 0.99999994
  [[   Count: 25000000 Sum: 12498267. Avg: 0.49993068 Min: 5.960464478E-8 Max: 0.99999994
  [[   Count: 30000000 Sum: 15000845. Avg: 0.500028193 Min: 0.E+0 Max: 0.99999994
  [[   Count: 35000000 Sum: 16777216. Avg: 0.479349017 Min: 0.E+0 Max: 0.99999994
  [[   Count: 40000000 Sum: 16777216. Avg: 0.419430405 Min: 0.E+0 Max: 0.99999994
  [[   Count: 45000000 Sum: 16777216. Avg: 0.372827023 Min: 5.960464478E-8 Max: 0.99999994
  [[   Count: 50000000 Sum: 16777216. Avg: 0.335544318 Min: 0.E+0 Max: 0.99999994
  [[  mg56% 
  [[  
  [[  It happens on both the IBM with xlf90 and the Sun Opteron cluster
  [[  with pathf90.  (Sorry, but 30000000 random numbers just isn't
  [[  enough!)



  #
  # Orion Sky Lawlor
  #

  16777216 is the maximum value reached by a single-precision float
  when adding numbers not exceeding 1.0--further additions of such small
  values don't affect the sum due to roundoff.

  Here's a C/C++ example of the same thing:
          float gnats=1.0;
          volatile float windshield=1<<24;
          float orig=windshield;
          for (int i=0;i<1000;i++)
                  windshield += gnats;

  Because of roundoff, the gnats do not affect the windshield in any way.
  The "Roundoff" section at the bottom of this page has more:

          http://www.cs.uaf.edu/2006/fall/cs301/lecture/11_10_float.html

  To fix this, you'll have to use a more roundoff-friendly "sum"
  implementation.  For example, you could write an explicit loop that
  adds up the array values into a real*8 (double-precision) temporary:

          real*8 :: doublesum
          ...
                  doublesum=0.0
                  do i=1,N
                          doublesum = doublesum + a(i)
                  enddo
          ...

  This should work with arrays of length up to 2^53 (about 10^16).
  It is quite silly that the builtin compiler "sum" intrinsic doesn't
  already do this!


  #
  # Ed Kornkven
  #

     program t3
     !
     ! Demonstrate cause of erroneous sum() results.  The "real" type
     ! is real*4 by default which has only 24 mantissa bits.  Once a
     ! sum of 2^24 is reached, adding a small number (e.g., 0<x<1) has no
     ! representable effect.  Solution is to do sum() in real*8 
     ! (function sum2) or sum the array in chunks (function sum3).
     ! Note that when N > 2*2^24, sum3 still fails because the second
     ! chunk suffers the same problem as the original array.
     ! E. Kornkven, ARSC
     !

       integer :: p, N
       real*4, allocatable, dimension (:) :: a
       real*8 sum2
       real*4 sum3

       do N=10000000,50000000,5000000
         allocate (a(1:N))

         !call random_seed ()
         !call random_number (a)
         a = 1

         !print*, "Count:", N, "Sum:", sum (a), "Avg:", sum(a)/real(N), "Min:"  &
           !, minval(a), "Max:", maxval(a), "Sum2:", sum2(a,N), "Sum3:", sum3(a,N)
         print 10, N, sum (a), sum(a), sum2(a,N), sum3(a,N)
   10    format ("Count:", i10, " Sum:", f10.1, " (0x", z8, ")", &
                 " Sum2:", f10.1, " Sum3:", f10.1)

         deallocate (a)
       enddo 

     end program

     real*8 function sum2(a, N)
     integer :: N
     real*4, dimension(N) :: a
     real*8 :: s
     s = 0.0
     do i = 1, N
        s = s + a(i)
     enddo
     sum2 = s
     return
     end

     real*4 function sum3(a, N)
     integer :: N
     real*4, dimension(N) :: a
     real*8 :: s
     integer :: half
     half = N/2
     sum3 = sum(a(1:half)) + sum(a(half+1:N))
     return
     end


  #
  # Sean Ziegeler
  #

  I don't think its a problem with the random number generator.  The
  summations of the real values are overflowing, which also causes the
  incorrect average.  You need to sum with double precision to prevent
  that (unless you are going to have lots more random numbers to sum).
  A quick fix to this code is to replace both instances of "sum(a)" with
  "sum(dble(a))".

  I tried it out with xlf90 on an IBM and it works.

  Some compilers may be smart enough to convert a single element of the
  array at a time for summing; others may copy the entire array into
  a double precision array.  If you want to prevent that for memory
  efficiency, you'll have to use a loop.


  #
  # Olivier Golinelli
  #

  Note that 16777216 = 2 ** 24.  This behaviour is a famous gotcha.
  Internally, the function "Sum(a)" is translated to

     suma = 0
     Do i = 1, n
        suma = suma + A(i)
     Enddo

  As 0 < A(i) < 1, "suma" grows during the loop execution.  Consequently,
  "suma" reaches the value 2 ** 24 when "i" is of order 2 ** 25.  After
  that, "suma" is locked at this value, because the addition 
  "suma + A(i)" is done with real values coded in single precision.
  As A(i) < 1, the ratio suma/A(i) > 2**24 and the addition always gives
  2**24.

  To circumvent this problem, two solutions:

  1) use real(8) everywhere by default in the program.  Then if you have a
     problem with memory, try to reduce with real(4), or

  2) write your own function, "sum," using a tree reduction pattern,
     and not a linear pattern, for example:


  Program test
  ! author : olivier.golinelli@cea.fr
  Implicit None 
  Integer, Parameter :: na = 10**8, wp = 4
  Real(wp) :: A(na), sum1, sum2, sum3
  Integer i, n, k

  Call Random_number(A)

  Print *, Sum(A), 'Function Sum'

  sum1 = 0
  Do i = 1, na
    sum1 = sum1 + A(i)
  End Do
  Print *, sum1, 'Loop with compiler optimization'

  n = size(A)
  sum2 = 0
  Do i = 1, na
    sum2 = sum2 + A(i)
    If (i == -1) Print *, 'blabla'
  End Do
  Print *, sum2, 'Simple loop without optimization'

  n = size(A)
  Do
    k = (n+1) / 2
    Do i = 1, n/2
       A(i) = A(i) + A(i+k)
    Enddo
    n = k
    if (n == 1) Exit
  enddo
  sum3 = A(1)

  Print *, sum3, 'Tree reduction'
  End Program


  Interestingly, the behavior of the 'Sum' intrinsic and of the loop
  depends on the optimization done by the compiler.  I use an old Intel
  Pentium4 with the Intel compiler "ifort version 8.0."  With that, the
  'Sum' intrinsic gives the good result.  The simple loop,

      Do i = 1, na
         sum1 = sum1 + A(i)
      End Do

  gives also the good result.  Clearly, the compiler substitutes 
  an optimized function for both.

  But if I disable optimization (by putting the print statement inside
  the loop) I obtain the behavior described in your newsletter. I.e.,
  the result reaches a maximum at 2**24 for single precision real numbers,
  real(4).

  Here's the output for the above program: 

    4.9997072E+07 Function Sum
    4.9997072E+07 Loop with compiler optimization
    1.6777216E+07 Simple loop without optimization
    4.9997072E+07 Tree reduction

  These gotchas are more numerous than we might expect.


  #
  # Edward Anderson
  #

  If you only have 7 or 8 digits of accuracy, you can't add 10's of
  millions of numbers in the same range without losing some of them.
  Once the sum gets big enough, adding a relatively small number to it
  has no effect.  The short answer is you need to use double precision
  to accumulate the sum (the compiler option "-r8" to promote reals to
  8 bytes, or 64 bits, is fairly standard), unless you subtract the mean
  to keep the sum around zero.

  Interestingly, the Intel compiler on an SGI Altix produces the expected
  result without (explicitly) resorting to double precision.


  #
  # Brad Chamberlain
  #

  I suspect the problem is that you're still using Fortran. :-)




Q: Regarding commenting in C/C++ code: I've seen code that uses /*
   to start comments in some places and /** to start them in other
   places. Why use /** ?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top