ARSC HPC Users' Newsletter 320, July 15, 2005

ARSC Symposium and Workshop on Reconfigurable Computing

ARSC is hosting a symposium and hands-on workshop on high-performance reconfigurable computing, using conventional and FPGA processors. It will be held in Fairbanks, Alaska, August 22-24, 2005. For details, a list of speakers, registration and local arrangements, see:

http://www.arsc.edu/news/archive/FPGA.html

Inlining in Cray C/C++


[ Thanks to Yiyi Yao, Ph.D candidate at George Washington University
for sharing his X1 inlining experiences and inspiring this article ]

The Cray Programming Environment 5.4 includes improved inlining functionality for C and C++. Previous versions of the Cray C compiler supported inlining within a file, however the latest C compiler adds inlining functionality for functions defined in external files or directories. (Note: the Cray Fortran compiler (ftn) has always provided this functionality.) This enhancement has the potential to greatly improve performance for C and C++ codes by allowing more of the code to vectorize and multistream.

The compiler flag "-h ipaN" designates how aggressively the compiler should attempt to inline code. With N=0, no inlining will be done, while N=5 designates aggressive inlining, where the compiler will attempt to inline all function calls. Note: using aggressive inlining can substantially increase the size of the resulting executable, but may also greatly improve the application performance. See the "ipa" section in "man cc" for complete details.

The flag "-h ipafrom=source [:source]" indicates that the compiler should attempt to inline subroutines from the specified files or directories.

Here are a few example compile statements:

Standard single file inlining:

cc -h ipa3 main.c -o main

Multiple file inlining:

cc -h ipa3 -ipafrom=vec.c:main.c main.c -o main

When using the "ipafrom" compiler flag be sure to include the file being compiled if there are functions that you would like to inline within that file.

User Experience:

We were developing/porting an Image Registration application from sequential C to UPC on the Cray X1. The purpose of the application is to determine whether or not 2 SAR images are identical. The images are allowed to be slightly misaligned. The algorithm searches the space from 0 degrees to 10 degrees and applies a 2D correlation to determine whether or not the images match.

These standard rotation functions (scan from 0 degrees to n degrees) and correlation functions are implemented as subroutines. The main procedure is a 3-level nested loop ('for' loops in this case) and it calls those subroutines when needed.

Since the Cray compiler can not multistream and vectorize loops with function calls, inlining is essential to ensure good application performance.

With optimizations set to O2, the Cray PE5.4 does a great job of detecting such function calls and will automatically inline these calls within a single file. With this level of optimization there is no need to specify inlining pragmas (e.g. "#pragma _CRI inline [function name]" )

The problem with inlining is it will generate huge executables. To avoid this, one should inline only critical functions, in our applications these are the 'Rotate' and 'correlate' subroutines. The easiest way to do that is to use directive based inlining (i.e. -h ipa1).

Here's an excerpt from the code showing the pragmas from our code:


...

#pragma _CRI inline Translate
#pragma _CRI inline Rotate
#pragma _CRI inline correlate
#pragma _CRI inline summation

...

int main(int argc, char **argv)
...

Where the following compile statement was used:


cc -h upc -h ipa1 regis.c -o regis

Note, "-h ipa1" indicates that user-defined pragamas will be honored.

Had one of the subroutines been in another file, inlining could have been enabled using the "ipafrom" compiler flag, e.g.:

cc -h upc -h ipa1 -h ipafrom=regis.c:rotate.c regis.c -o regis

See HPC Newsletter 274, "Inlining in Cray FTN," for additional inlining techniques: /arsc/support/news/hpcnews/hpcnews274/index.xml .

Creating Sequences of Batch Jobs in PBS: Part II of II

As described in the first part of this series (issue #319, > /arsc/support/news/hpcnews/hpcnews319/index.xml#article1 ), the X1 batch scheduler allows you to submit a string of jobs which depend upon each other in various ways. In this article, we dive into a specific example, which may help illuminate the dependency feature and its limitations.

The example is a hypothetical program which runs on 4 processors and is expected to converge in some unknown time under 48 hours. Thus, in klondike's 16 hour "small" queue, it could take 1, 2, or 3 runs to converge. It writes frequent restart files, so when a run is terminated (e.g. it smashes into the walltime limit), the next can run pick up without significant loss of effort. After the program converges a post-processing job is to be run.

The question is: can we submit the three primary PBS jobs plus the one post-processing job all at once (and go home for the weekend)? We can try to construct the desired sequence using the "qsub" attributes mentioned in the previous issue:


  "afterok" -- which starts a job if its predecessor ends without error.
  "afternotok" -- starts job if predecessor ends on error.
  "afterany" -- starts job when predecessor ends -- for any reason.

Here's more information, from testing:

  1. if some PBS job ("AA") runs out of walltime, then PBS kills it. PBS considers this exiting on error.
  2. if a job ("BB") is waiting on "AA" with "depend=afternotok", it will start when "AA" exits on error.
  3. if a job ("BB") is waiting on "AA" with "depend=afterok", it will be **deleted** from the queues when "AA" exits on error. No trace of it will remain on the system (e.g., no ".o" nor ".e" file will be produced). PBS will alert the user by sending an email message.
  4. if a job ("CC") is waiting on "BB" with any dependency, and "BB" is deleted from the queues (as in 3) then job "CC" will continue to wait forever and ever... until you "qdel" it.

Here's a sequence of submitted test jobs:


  klondike$ qsub tt_1.pbs
  36005.klondike
  klondike$  qsub -W depend=afternotok:36005 tt_2.pbs
  36006.klondike
  klondike$  qsub -W depend=afternotok:36006 tt_3.pbs
  36007.klondike
  klondike$ qsub -W depend=afterany:36007 tt_post_process.pbs
  36008.klondike

The test was devised so that:

  1. tt_1.pbs would run out of time (end on error)
  2. tt_2.pbs would complete successfully (end without error)
  3. tt_3.pbs would complete successfully (end without error)
  4. tt_post_process.pbs would complete successfully

And the observed result was:

  • tt_1.pbs ran and ended on error.
  • tt_2.pbs started when tt_1 was killed. tt_2 ended normally.
  • tt_3.pbs never ran, being deleted from the queues when tt_2 ended. PBS emailed me with this information.
  • tt_post_process.pbs remained in the queues with status "H."

Everything worked, except the post-processing job. (If anyone knows how to make this work in the given scenario, let me know and I'll pass it on through the newsletter.)

--

For the curious, here's the email I received from PBS, regarding the deleted job, tt_3:


Date: Tue, 21 Jun 2005 13:29:55 -0800 (AKDT)
From: root-klondike <root@klondike.arsc.edu>
Subject: PBS JOB 36007.klondike

PBS Job Id: 36007.klondike
Job Name:   tt_3.pbs
Aborted by PBS Server
Job deleted as result of dependency on job 36006.klondike

Here's how the dependent jobs appear when they're in the queues, waiting for a dependency to be satisfied:


  klondike$ qstat -a 
 grep userid
  36008.klondike  userid   small   tt_post_pr    --   --  --    --  00:05 H   --

In the qstat output, status "H" indicates "Hold." Note that jobs on klondike can be held by sysadmins (to move high priority work forward, prepare for system maintenance, etc...) and these jobs are given the same status.

The indication that the "Hold" is due to a dependency that you have created yourself is in the full qstat listing for the job ("qstat -f"), under "Hold_Types". E.g.,:


  klondike$ qstat -f  36008 
 grep Hold
      Hold_Types = s

From "man qhold," here are the possible hold types:


  u - USER
  o - OTHER
  s - SYSTEM
  n - None

"u" tells you that some user, like a sysadmin, has held your job. "s" indicates that it is a PBS hold, due to qsub -W depend=.... If you execute "qstat -f" on your dependent jobs while the sysadmins are holding everything in preparation for downtime, you'll see both hold types listed:


 Hold_Types = us

After downtime, when the sysadmins release the User hold, your job will return to its earlier state,


 Hold_Types = s

In addition, "qstat -f" will list the job's dependencies.

Quick-Tip Q & A


A:[[ Read any good computation/parallel programming/science books
  [[ recently?  If so, send title and short review.

  #
  # Thanks to Kate Hedstrom for our sole recommendation: 
  #
  "The Northern Lights" by Lucy Jago tells the story of Birkeland, the
  scientist who discovered how the aurora works. It is well-written and
  fascinating. Lots of good reviews at Amazon.com.


Q: Is there a way to tell whether or not a gzip file is complete?  I'd 
   rather not extract the contents, but would like to verify that the
   compression operation was successful.

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top