ARSC T3E Users' Newsletter 127, October 10, 1997

Streams: On or Off?

On the T3E, streams refers to read-ahead of local memory locations. When enabled, this read-ahead attempts to fetch data from memory into cache before it is actually needed. Depending on a code's memory access pattern, it can improve performance significantly.

Whether or not streams are enabled or disabled in a run of your program is determined by settings in the system configuration files, user environment, and user program. A program's stream settings over-ride the environment's, which over-ride the system default.

In order to set the stream level, the following commands may be used:

  System default:   <Not under user control>

  In The User Environment:  
    Set the environment variables SCACHE_D_STREAMS and/or SCACHE_I_STREAMS
    to desired stream level.  For instance:

    export SCACHE_D_STREAMS=0  [ksh user turns data streams off]
    setenv SCACHE_I_STREAMS 1  [csh user turns instruction stream on]

  In Programs:
    Call one of the routines SET_D_STREAM() or SET_I_STREAM() (or the C
    equivalents) from within your code.  For instance:

       CALL SET_D_STREAM(0)            ! disable data streaming

In order to determine what settings are currently in use, call the functions, GET_D_STREAM() or GET_I_STREAM() (or the C equivalents).

Here's a simple example:

      program prog

      integer get_d_stream

      write(6,*) ' streams were initially ',d_stream_value

      call set_d_stream(1)
      write(6,*) ' streams have been set to ',d_stream_value

      call set_d_stream(0)
      write(6,*) ' streams have been set to ',d_stream_value


Streams can be set to any of four levels. You should experiment with data streams on and off, at the minimum, and possibly with all four different levels and with both the instruction and data stream capabilities. The following description of the different levels is taken from man GET_D_STREAM on yukon:

   0 or _MPC_NO_STREAMS               Deactivates read-ahead in the
                                      stream buffers.  Previously
                                      detected small-strided reference
                                      streams continue to be active, but
                                      no new small-strided access streams
                                      are detected.

   1 or _MPC_DETECT_STREAMS           Activates read-ahead in the stream
                                      buffers.  Stream detection occurs
                                      upon reference to the second of two
                                      successive 8-word secondary cache

   2 or _MPC_INITIAL_PREFETCH         Sets stream detection to the level
                                      set by _MPC_DETECT_STREAMS, but in
                                      addition any secondary cache and
                                      stream buffer misses result in a
                                      prefetch of the next successive
                                      secondary cache line.

   3 or _MPC_AGGRESSIVE_PREFETCH      Sets stream detection to the level
                                      set by _MPC_INITIAL_PREFETCH, but
                                      additional aggressive read-ahead is

Additional ARSC Users Note: Although our system default is now to have streams on at level 1, if your code calls any SHMEM routine, the executable will be set to over-ride this at run-time, and run with streams off. If you want your program to use streams, set them explicitly to the level you want. At run-time, set the SCACHE_D_STREAMS environment variable to 1 in your or code your program to turn streams on internally by calling SET_D_STREAMS().

Cache Bypass directive

Having installed programming environment 3.0 and upgraded yukon to be streams-safe, we decided to run this follow-up to an article that appeared in issue #120 . We test several different means of copying one array to another on a single processor on yukon.

This revised program can test the PE3.0 cache_bypass directive to force copies through the E-register rather than cache. Under PE2.0, shmem_put and shmem_get were clearly the fastest available methods. However, running with streams enabled, a conventional loop with cache_bypass is equally fast.

In the following table, the timings are in units of number of array elements (64-bit words) transferred per second, and should not be construed as memory bandwidth. The tests were made on ARSC's 450MHz T3E, yukon. In order to make the runs on application (rather than shell) PEs, they were started on 2 PEs (mpprun -n2), and the results from one of the two PEs were discarded.

Here are the timing results followed by the program:

                               STREAMS ON   STREAMS OFF
                              ==========   ===========
Copying       10 words:

cache bypass loop construct      1. MW/s     1. MW/s
cache bypass f90 array op        3. MW/s     3. MW/s
shmem_get                        2. MW/s     2. MW/s
shmem_put                        3. MW/s     3. MW/s
f90 array op                     3. MW/s     2. MW/s
loop construct                   2. MW/s     2. MW/s

Copying      100 words:

cache bypass loop construct      8. MW/s     8. MW/s
cache bypass f90 array op       13. MW/s     8. MW/s
shmem_get                       15. MW/s    15. MW/s
shmem_put                        6. MW/s    19. MW/s
f90 array op                    13. MW/s     8. MW/s
loop construct                  12. MW/s     8. MW/s

Copying     1000 words:

cache bypass loop construct     31. MW/s    31. MW/s
cache bypass f90 array op       19. MW/s     9. MW/s
shmem_get                       31. MW/s    26. MW/s
shmem_put                       32. MW/s    32. MW/s
f90 array op                    19. MW/s    10. MW/s
loop construct                  19. MW/s    10. MW/s

Copying    10000 words:

cache bypass loop construct     36. MW/s    36. MW/s
cache bypass f90 array op       23. MW/s    10. MW/s
shmem_get                       36. MW/s    36. MW/s
shmem_put                       36. MW/s    36. MW/s
f90 array op                    23. MW/s    11. MW/s
loop construct                  23. MW/s    11. MW/s

Copying   100000 words:

cache bypass loop construct     36. MW/s    36. MW/s
cache bypass f90 array op       27. MW/s    11. MW/s
shmem_get                       35. MW/s    37. MW/s
shmem_put                       36. MW/s    37. MW/s
f90 array op                    28. MW/s    11. MW/s
loop construct                  28. MW/s    11. MW/s

Copying  1000000 words:

cache bypass loop construct     36. MW/s    36. MW/s
cache bypass f90 array op       28. MW/s    11. MW/s
shmem_get                       36. MW/s    36. MW/s
shmem_put                       36. MW/s    36. MW/s
f90 array op                    28. MW/s    11. MW/s
loop construct                  28. MW/s    11. MW/s

c T3E program to measure memory bandwidth using different techniques
c for copying arrays.  For best (?) performance, compile with:
c  -Ounroll2
c and run on T3E application PEs rather than shell PEs by requesting
c 2 PEs:
c  mpprun -n2 
       program prog
       implicit none
       integer, parameter::SZ=1000000
       real c(SZ), a(SZ) 
       integer i,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12
       integer irtc, ierr, index, iclktck, my_pe, shmem_my_pe
       integer copycnt, copysz

       ! Get machine clock ticks per second
       call pxfconst ('CLK_TCK',index,ierr)
       call pxfsysconf (index, iclktck, ierr)

       ! Get my pe
       my_pe = shmem_my_pe()

       do copycnt=1,6
         c = 1

! copy arrays in loop with cache_bypass directive
         a = 0
         t7 = irtc ()
!dir$ cache_bypass c,a
         do  i=1,copysz
          a(i) = c(i) 
         t8 = irtc ()
         call chkcopy (a, c, copysz, SZ) 

! copy arrays f90 syntax with cache_bypass directive
         a = 0
         t9 = irtc ()
!dir$ cache_bypass c,a
         a(1:copysz) = c(1:copysz)
         t10 = irtc ()
         call chkcopy (a, c, copysz, SZ) 

! copy arrays using shmem_get.
         a = 0
         t11 = irtc ()
         call shmem_get (a, c, copysz, my_pe)
         t12 = irtc ()
         call chkcopy (a, c, copysz, SZ) 

! copy arrays using shmem_put.
         a = 0
         t1 = irtc ()
         call shmem_put (a, c, copysz, my_pe)
         t2 = irtc ()
         call chkcopy (a, c, copysz, SZ) 

! copy arrays using f90 array operation
         a = 0
         t3 = irtc ()
         a(1:copysz) = c(1:copysz)
         t4 = irtc ()
         call chkcopy (a, c, copysz, SZ) 

! copy arrays using loop
         a = 0
         t5 = irtc ()
         do  i=1,copysz
          a(i) = c(i)
         t6 = irtc ()
         call chkcopy (a, c, copysz, SZ) 

      write (6,'("Copying ", i8, " words:")') copysz
      write (6,*) 

      write (6,1000) "cache bypass loop construct",
     &          (copysz / ((t8-t7)/real(iclktck))) / 1000000

      write (6,1000) "cache bypass f90 array op",
     &          (copysz / ((t10-t9)/real(iclktck))) / 1000000

      write (6,1000) "shmem_get",
     &          (copysz / ((t12-t11)/real(iclktck))) / 1000000

      write (6,1000) "shmem_put",
     &          (copysz / ((t2-t1)/real(iclktck))) / 1000000

      write (6,1000) "f90 array op",
     &          (copysz / ((t4-t3)/real(iclktck))) / 1000000

      write (6,1000) "loop construct",
     &          (copysz / ((t6-t5)/real(iclktck))) / 1000000

1000  format (A,t30,f6.0," MW/s")
      write (6,*) 
      write (6,*) 


! Called mainly to ensure that compiler doesn't remove the entire
! copy operation as an optimization.       

       subroutine chkcopy (a, c, copysz, SZ)
       integer copysz, SZ
       real c(SZ), a(SZ) 

       integer i
! Verify copy completed.
         do  i=copysz,1,-1
           if (c(i) .NE. a(i)) stop "copy failed"


CF90 Optimization Options

Often users ask what are the best options to use with the Fortran compiler on code they are porting to the T3E.

This is a difficult question, since the effectiveness of each option varies greatly with both the nature of the algorithm and the programmer's expression of the algorithm to the compiler. However, a procedure for optimisation is to start with no options and then incrementally add the following options in this order, noting how performance changes and checking results at each stage. (Some optimisations will change results since the order of operations may change which can make rounding errors different. These changes should be small.)

It is also useful to assess the performance of the different code parts before starting this exercise since some options may speed up one part and slow down another. However the application of the options below should result in an improvement with relatively modest effort from the programmer.

          this is a basic optimisation set.

         this increases the internal tables for the compiler and
         allows further modification of user code for performance

       -O3,aggress -apad 
         this pads arrays on which there might be cache conflicts.

       -O3,aggress,unroll2 -apad  
         this starts to unroll loops to exploit multiple operations
         per load into cache from main memory.

       -O3,aggress,unroll2,pipeline2 -apad  
         here pipelining is exploited to try and get one result per
         clock cycle.

       -O3,aggress,unroll2,pipeline2,split2 -apad  
         here loops are split to restrict the number of streams open
         at any one time to that in the hardware.

       -O3,aggress,unroll2,pipeline2,split2 -apad -lmfastv  
         this option uses a faster but not fully IEEE compliant math
         library for intrinsics.

Hopefully, the above seven set of options will allow users to make a first pass at optimising code with the help of the compiler. After applying these options the programmer must now consider explicit code modification to exploit the features of the underlying architecture. More on these changes in future newsletters.

For more information these options are described in greater depth and some examples given in the Cray document, 'The Benchmarker's Guide to Single-processor Optimisation for Cray T3E Systems." It is available in postscript via anonymous ftp to: . It is in the directory: pub/mpp/docs , and is named: .

ARSC User Forum Next Week

On October 13th and 14th, ARSC is holding a forum for users local to Fairbanks. Monday morning presentations by ARSC staff will give updates on ARSC resources. Following that session, there will be a series of short talks by users of those resources. The schedule for the forum is listed below or at:

All local users are encouraged to attend. The forum will be held in Butrovich 109, the Regents Conference Room.

                             ARSC User Forum

  MONDAY                                                 Oct 13, 1997
9:00 AM    ARSC Presentations

             o    Barbara             Introductions
             o    Frank Williams      Welcome
             o    Guy Robinson        T3E Upgrades and Transition Plans
             o    Tom Baring          Programming Environment 3.0 and
                                      Software Update
10:15 AM   Break

             o    Sergei Maurits      ARSC Visualization Update
             o    Virginia Bedford    ARSC Mass Storage Plans

11:30 AM   Lunch Break
1:00 PM    SAR Processing

             o    Rick Guritz         Technology Prototyping in SAR
                                      Processing at ASF
             o    Thomas Logan        PAISP - The Parallel ASF
                                      Interferometric SAR Processor

             o    Chris Hartman       Use of OpenGL in CS381
                                      Music is to time as visual art is to
             o    Bill Brody          space & Hulahula: an evolving
             o    Sergei Maurits      The Polar Ionosphere Model and its
                                      Real-Time Applications
2:30 PM    Break
2:45 PM    Mechanical Engineering

             o    Jonah Lee           Computational Mechanical Engineering
                                      & Collaborative Computing Using CORBA

             o    Tinggang Zhang      A Numerical Approach TO Fatigue


             o    Knut Stamnes        Title?
                                      Time-Dependent Behavior of Soft Rock
             o    Gang Chen (Mining)  Strata in Underground Mines &
                                      Computer Simulation of Blasting

             o    Giray Okten         Applications of Hybrid-Monte Carlo

 TUESDAY                                                 Oct 14, 1997
9:00 AM      o    Barbara             Intro to Day 2

9:05 AM     Atmospheric Science

                                      Improved visualization of carbon
             o    Bob Andres          dioxide emissions from fossil fuel

             o    Jeff Tilly          Regional Modeling in the Western

             o    Alexander Mahura    Atmospheric Transport Pathways for
                                      Pollutants - Trajectory Model Studies
10:15 AM   Break
10:30 AM   Oceanography

             o    David Eslinger      3D Coupled Biological & Physical
11:00 AM   ARSC Tour
11:30 AM   Lunch Break
1:00 PM    Geophysics

             o    Elena Troshina      A Time-Dependent Numerical Model of
                                      the Antarctic Icesheet

             o    Sukumar             Ventilation for Arctic Mines

            Space Science

             o    Antonius Otto       Plasma Processes in the Earth's

             o    Peter Delamere      A Hybrid Code for an F-region
                                      Chemical Release
                                      Space Plasma Simulations Using Hybrid
             o    Daniel Swift        Codes in Generalized Curvilinear
2:30 PM    Break
2:45 PM    Miscellaneous

             o    Chuen-Sen Lin (ME)  Mechanical Design & Motion Analysis

3:00 PM    T3E Presentation

             o    Guy Robinson        Using the CRAY T3E

3:30 PM    ARSC Wrapup

             o    Barbara             Summary of "Care-Abouts"

Quick-Tip Q & A

A: {{ Is this a good idea?  Self-documenting?  Is it even valid!?  }}
  The C-snippet offered last week, which used a question mark-colon
  construct as an lvalue, was extracted from the code to W3's browser,
  "arena."  Here's more of the code:
    #ifdef __STRICT_ANSI__
       buffer_cell->prev->next = buffer_cell->next;
       buffer->cell = buffer_cell->next;
        ((buffer_cell->prev) ? buffer_cell->prev->next : buffer->cell) =
    #endif /* __STRICT_ANSI__ */
  Which provides part of our answer. Depending on your compiler, it
  is indeed valid, but does not conform to the ANSI C standard.  
  Why would it be a good idea?  
  If programmer productivity is measured in lines of code per day, then
  it might be a "bad" idea.  If measured in program effort per line of
  code, then it's "good."  More likely, it would be "good" if it leads
  to a more efficient executable.

  To help in testing, here's a trivial program which uses the construct:

      #include <stdio.h> 
      main() {
       int i=0,j=1;

       printf ("i:%d j:%d\n", i, j);

       *(i < j ? &i : &j) = 100;

       printf ("i:%d j:%d\n", i, j);

       if (i < j)
         i = 200; 
         j = 200;

       printf ("i:%d j:%d\n", i, j); 

  What follows is part of the output of the command "cc -S" run on the
  above program (the "-S" option instructs the compiler to translate
  the C-code to assembly). This output was produced on an SGI

  [ ----------------------------------------  2 labels, 8 commands ]
       #   9    *(i < j ? &i : &j) = 100;
       lw      $15, 44($sp) 
       lw      $24, 40($sp) 
       bge     $15, $24, $32
       addu    $16, $sp, 44 
       b       $33 
       addu    $16, $sp, 40 
       li      $25, 100 
       sw      $25, 0($16)
       .loc    2 11

  [ ----------------------------------------  2 labels, 8 commands ]
       #  13    if (i < j)
       lw      $8, 44($sp) 
       lw      $9, 40($sp) 
       bge     $8, $9, $34
       .loc    2 14
       #  14      i = 200;
       li      $10, 200 
       sw      $10, 44($sp) 
       b       $35 
       .loc    2 16
       #  15    else 
       #  16      j = 200;
       li      $11, 200 
       sw      $11, 40($sp) 
       .loc    2 18
       #  17

  The question mark-colon construct didn't simplify the assembly
  output, but it's likely that the indirect addressing obtained
  ("addu") is faster than the direct addressing ("sw").

Q: In Unix, how can you list the contents of the current directory,
   and the contents of every subdirectory within the current directory,
   but not descend any deeper into the depths of the tree of

[ Answers, questions, and tips graciously accepted. ]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top