ARSC HPC Users' Newsletter 340, May 12, 2006

Book Raffle


[ Thanks to Lee Higbie for an update on the book raffle ]

The response to the Raffle (Issue > 339 ) has been so great that some entrants will not receive a copy of Wicked Cool Java. With the current number of entrants, the probability of winning is still high. Submit your raffle entry to higbie (at) arsc.edu by the deadline on May 31st.

Also, given the number of answers to the problems (see end of review in Issue 339), solutions to either of them appear certain to win a book. Prove me wrong. Submit a solution. If there are no solutions, I'll give both books to raffle entrants.

For more information on the contest see:     > /arsc/support/news/hpcnews/hpcnews339/index.xml#article3

Optimizing XD1 Codes with SSE Hardware Instructions

The AMD Opteron processor serves as the computational power for ARSC's Cray XD1, nelchina. Like Intel compatible processors back to the Pentium II, the Opteron supports single instruction multiple data (SIMD) operations (a.k.a. vector instructions). These days the SIMD support is known as Streaming SIMD Extensions (SSE), which Intel introduced first with the Pentium III and later improved with the Pentium IV. Newer Opteron processors supports, SSE, SSE2, and SSE3. SSE2 offers significant improvements over the older SIMD support (i.e. MMX, 3DNow and SSE) by adding 8- 128 bit wide multi-purpose registers. There are some advantages to using SSE instructions over the traditional x87 floating point unit.

  • For single precision codes, SIMD instructions double the theoretical peak number of multiply and add instructions that can occur per clock cycle. The x87 floating point unit allows 1 multiply and 1 add per clock cycle.
  • For double precision codes there is no theoretical performance gain over the x87 floating point unit, though the SIMD instructions will increase the code density (i.e. reduce the number of assembly instructions).
  • Since SSE2 instructions use a flat register file instead of a register stack like the x87 floating unit some optimizations are simpler to perform.

I was interested how much of a difference SSE instructions would make on the performance of a code, so I borrowed Guy Robinson's GFLOP contest winning code from issue 213 (see link below ). Back in 2001, this code attained 1037.77 MFLOPS on ARSC's old Cray SV1 vector machine. The original goal of the GFLOP contest was to get as close as possible to the peak processor performance on the SV1's vector processors. The GFLOP code was designed to run optimally on the SV1 vector processors, so it was likely it wouldn't get as close to the peak theoretical performance on the Opteron as it did on the SV1.

The documentation for the Portland Group compilers, which are available on the XD1, recommends the following flags to get started quickly with optimizations:


pgf90 -fastsse -Mipa=fast gflop.f

Since GFLOP doesn't have any subroutines, I dropped the "-Mipa=fast" flag from the compiler flags. Here are some performance results for various optimization flags.

Optimization Flags Real Time (s) MFLOPS % of Peak Theoretical
-O0 119.95 774.90 16.14 %
-O1 111.23 829.48 17.28 %
-O2 89.18 971.71 20.24 %
-03 89.20 971.08 20.23 %
-fastsse 85.91 1006.56 20.97 %
-fast 84.90 1021.36 21.28 %

NOTES:

  1. MFLOP numbers from PAPI library. (see reference "C" )
  2. Theoretical Peak is 4.8 GFLOPS for 2.4 GHz AMD Opteron Processors.

So as it turns out for the GFLOP code, the "-fast" optimizations slightly out perform "-fastsse". When I saw this it made me curious whether or not the "-fast" optimizations were using any SSE2 instructions. Adding the "-Mkeepasm" compiler option tells pgf90 to keep the intermediate assembly that it creates. When I recompiled with the new compiler flags, I found that indeed the "-fast" optimizations do use SSE instructions.


pgf90 -fast -Mkeepasm gflop.f -o gflop.fast

The names of the SSE registers all begin with "xmm", so a grep of the assembly code, will show all of the SSE instructions.


nelchina 1% grep xmm gflop.s 
        movss   .C1_287(%rip),%xmm1
        movaps  %xmm1,%xmm0
        movss   %xmm0,-484(%rcx)
        cvtsi2ss        %r14d,%xmm2
...
...

At this point I was curious which instructions were being used. Since SSE registers can do either integer or floating point operations, the presence of a "xmm" register reference doesn't necessarily mean that the instruction is a floating point instruction.


nelchina 2% grep xmm gflop.s 
 while read i j ; do echo $i; done 
 sort -u
addss
cvtsi2ss
divss
movaps
movss
mulss
subss

A web search shows that these are definitely floating point SSE instructions (see reference "D" ). Here's the what Intel's documentation says about "ADDSS":


    ADDSS               Add Single Scalar
    
    Opcode              Cycles  Instruction
    F3 0F 58    1 (3)   ADDSS xmm reg,xmm reg/mem32
    
    ADDPS op1, op2
    
    op1 contains 4 single precision 32-bit floating point values
    op2 contains 1 single precision 32-bit floating point value
    
        op1[0] = op1[0] + op2
        op1[1] = op1[1]
        op1[2] = op1[2]
        op1[3] = op1[3]
    

As it turns out, the Portland Group compilers will use SSE instructions even at "-O0" for Opteron processors using 64 bit addressing, so the difference performance is between optimization levels is not based strictly on the use of SSE instructions.

See "pgf90 -fast -help" and "pgf90 -fastsse -help" to see which options "-fast" and "-fastsse" imply.

In a future article, we will discuss the PAPI library, which we used to get the performance numbers for this article, so stay tuned.

References

  • AMD Software Optimization Guide for AMD64 Processors; Publication #25112; Revision 3.06; Issue Date: September 2005 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
  • IBM: OBJECT_MODE Environment Variable

    Iceberg and iceflyer both support 32 and 64 bit addressing. The default addressing scheme is 32 bit, however the environment variable OBJECT_MODE allows one to override the default.

    The OBJECT_MODE environment variable is understood by a number of IBM commands including:

    • ar (substitutes for -X32, -X64 or -X32_64)
    • as (substitutes for -a32 and -a64)
    • ld (substitutes for -b32 and -b64)
    • dump (substitutes for -X32, -X64 or -X32_64)
    • nm (substitutes for -X32, -X64 or -X32_64)
    • xlc, xlf and other compilers (substitutes for -q32, -q64)

    The OBJECT_MODE environment variable can be particularly useful when you are trying to build a 64-bit application and libraries. Rather than altering the Makefile, you can simply set the OBJECT_MODE variable and build.

    e.g.
    
    export OBJECT_MODE=64
    ./configure
    make  
    

    You may need to alter the library and include path for the 64 bit version.

    The values of OBJECT_MODE which are valid for all commands are 32 and 64. A third value 32_64 is only valid for the ar, dump, and nm commands.

    Quick-Tip Q & A

    
    
    A:[[ How can I run a remote command on a system and use the remote 
      [[ values for $HOME, $WRKDIR, etc. instead of the local values?
      [[
      [[ For example, I'd like this command to work as I obviously intend 
      [[ it to work (even if WRKDIR is defined differently on the local 
      [[ and remote systems):
      [[
      [[   scp -r remotesys:$WRKDIR/mymegamodel/answers .
    
      #
      # Thanks to Jed Brown:
      #
      You just have to escape shell expansion.  Thus, this
    
        % scp -r 'remotesys:$WRKDIR/mymegamodel/answers' .
    
      or even this,
    
        % scp -r remotesys:\$WRKDIR/mymegamodel/answers .
    
      works just fine.
    
    
      #
      # From Scott Kajihara:
      #
      This is a common question. The command should read
    
         scp -r remotesys:'$WRKDIR/mymegamodel/answers' .
    
      Note: it is important that this be single quotes so that the
      environment variable is not expanded. Quoting also prevents wildcard
      expansions on the local machine.
    
    
      #
      # And thanks for an in-depth explanation, from: ./Greg-Newby --verbose
      #    :-)
      #
      Success will depend on where such variables are defined, and have some
      shell sensitivity.  Quoting from "man tcsh"
    
          Non-login shells read only /etc/csh.cshrc and ~/.tcshrc or
          ~/.cshrc on startup.
    
      Other shells (ksh, bash...) are similar.
    
      When you do an scp, it really runs an ssh shell, but the shell is not
      a login shell.  To test whether a variable is defined for non-login
      shells, try the "echo" command:
    
          ssh remotesys echo '$HOME'
          ssh remotesys echo '$WRKDIR'
    
      Use single quotes, not double quotes.  Double quotes will be evaluated
      by your local shell:
    
          WRONG:  ssh remotesys echo "$HOME"
          yields: $HOME on your local system
    
          RIGHT:  ssh remotesys echo '$HOME'
          yields: $HOME on the remote system
    
    
      In this example, it's fine to place the quotes in different places, as
      long as the variable itself is quoted.  As for any variable use in a
      shell, you can use curly braces to separate the variable name in case
      it is ambiguous.
    
      Examples:
    
          scp -r remotesys:'$WRKDIR'/mymegamodel/answers .
          scp -r remotesys:'$WRKDIR/mymegamodel/answers' .
          scp -r remotesys:'${WRKDIR}'/mymegamodel/answers .
    
      One way I often use remote variable expansion is for lazy path
      globbing.  Let's say I want to get:
          remotesys:'somelongdirectoryname/someuniquefilename' 
      and don't mind making the remote system work harder to match
      filenames... or, perhaps I want all files from a particular remote
      directory.  The * wildcard works great (you could also use the ?
      wildcard if you want to match single characters):
    
          scp -r remotesys:some'*'/someunique'*' .
      or, scp -r remotesys:allmatching.'*' .
      or, scp -r remotesys:mysubdir/'*' .
    
    Q: What development environments do you use to write
       C/C++/FORTRAN/<other> code, and how do you manage your (possibly
       many) source code files.  What editors and other tools do you 
       use?
    
    

    [[ Answers, Questions, and Tips Graciously Accepted ]]


    Current Editors:
    Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
    Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
    Arctic Region Supercomputing Center
    University of Alaska Fairbanks
    PO Box 756020
    Fairbanks AK 99775-6020
    E-mail Subscriptions: Archives:
      Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
    Back to Top