ARSC HPC Users' Newsletter 312, March 25, 2005

Making Jobs (non-) Checkpointable


[Thanks to Lee Higbie, ARSC Vector Specialist]

Whenever there is a system shutdown, whether from a hung processor or for scheduled maintenance, all the jobs running must be stopped.

  1. For the IBM systems, this currently requires killing any running jobs, which normally means their output is lost. For this reason, the usual procedure on Iceberg and Iceflyer is to stop the queues well in advance of the scheduled downtime so that no jobs will still be running at the start of the downtime. Thus, no jobs will have to be terminated prematurely. You have to program your own restart capability, if you want it. Currently all classes (i.e. queues) on Iceberg are stopped prior to downtime except for the "killable" class. The "killable" class allows jobs with built-in checkpointing to run up to beginning of the downtime. The killable class can be specified via the Loadleveler class keyword: e.g.
    
    # @ class = killable
    
  2. For the Cray systems, most jobs can be checkpointed by the system checkpoint/restart utilities. This means that the system is able to save the state of running jobs, stop them, and then resume them after the downtime. A job's wallclock time is increased by the duration of the downtime, but otherwise the interruption is generally transparent to the user. Cray's checkpoint/restart functionality allows us to take down the X1 with almost no wasted cpu time by keeping it fully loaded with useful work until the start of the actual downtime.

While most jobs are checkpointable, some are not. If a job is non-checkpointable it will be terminated and rerun from the beginning after the downtime, unless the user launched the job with qsub's "-r n" option (see "man qsub").

This note concerns ways to make your X1 job checkpointable which will benefit you in two ways:

  1. It will save you allocation.
  2. It will give you faster turnaround because your job will not have to restart from the beginning following downtimes.

A well documented no-no for production X1 programs is using Cray's performance monitoring tool, pat_hwpc, which makes the job non-checkpointable. For this reason, even though pat_hwpc's usage overhead is negligible, it should not be used on large time-limit or production jobs. pat_hwpc should only be used for analysis, not for production.

A related impediment to checkpointing occurs when the executable is built using pat_build and an environment variable such as PAT_RT_HWPC or PAT_RT_RECORD_PE is set.

Some of the other job attributes that prevent checkpointing a job are listed below. These should all be avoided for long-running production programs:

  • Network socket connections
  • X terminals and X11 client sessions
  • Files opened with a setuid credential that cannot be reestablished

For a complete list of features that are not checkpoint-safe, see "man cpr".

ARSC Spring Training

ARSC spring training starts at the end of March and will include three classes on the Cray X1 presented by James Schwarzmeier of Cray. All courses will be held in the ARSC classroom in the West Ridge Research Building room 009.


Tuesday March 29, 2005 
  Introduction to ARSC Resources: Kate Hedstrom
  1 p.m. - 2 p.m.

Wednesday March 30, 2005
  Co-Array Fortran Programming: James Schwarzmeier, Cray, Inc.
  9 a.m. - Noon

Thursday March 31, 2005
  X1 Optimization: James Schwarzmeier, Cray, Inc.
  9 a.m. - Noon

Friday April 1, 2005
  X1 Macro and Micro Architecture: James Schwarzmeier, Cray, Inc.
  9 a.m. - Noon

Tuesday April 5, 2005
  Introduction to UNIX: Lee Higbie
  1 p.m. - 2 p.m.

Tuesday April 26, 2005
  Introduction to IDL: Sergei Maurits
  1 p.m. - 2 p.m.

James Schwarzmeier will be available to users of the X1 for one on one consultation the afternoons of March 30 - April 1. Contact Tom Logan (tom.logan@arsc.edu / 907-450-8624) for more information on spring training opportunities or to schedule a meeting with James.

Displaying X1 Module Package Versions


[Thanks to Wendy Palm of ARSC for the story idea]

The "pe-version" utility is included in the COS-3.6 module on klondike. This tool displays information for the currently loaded module environment, including package versions for C and Fortran compilers and more. This information can be particularly useful when documenting problems or when reporting problems to ARSC staff for further investigation.


E.g.:
klondike 1% pe-version    
 
*** pe-version (20040303) run at:
Mon Mar 21 11:17:27 AST 2005
 
*** Current system and OS information:
UNICOS/mp klondike 2.5.26 02270947 crayx1
OS_LEVEL is set to: 2.5.26
 
*** Currently loaded Programming Environment components:
# cftn package version is: 5.3.0.1
# CC package version is: 5.3.0.1
# craylibs package version is: 5.3.0.1
# libsci package version is: 5.3.0.0
# craytools package version is: 5.3.0.1
# mpt package version is: 2.4.0.2
# cal package version is: 1.2.0.3
# totalview package version is: 6.3.1.1
 
*** Compiler/Assembler/Loader/CrayPat Versions:
Cray Fortran: Version 5.3.0.1 (u4031f93049i20076p53141a53007e53011x9317)
Cray Fortran: Mon Mar 21, 2005  11:17:28
Cray Standard C: Version 5.3.0.1  Mon Mar 21, 2005  11:17:29
Cray NV Assembler Version 1.2.0.3   (s1203a52020e53010z1056) (Built Aug 18 2004)
Cray X1 ld version 5.3 (l53030) -- Mon Mar 21 11:17:30 2005.
CrayPat:  Version 23.190  11/23/04 09:54:22
 
*** Current Trigger Versions:
   trigsnd: @(#)20/trigsnd.c    20.23   11/04/2003 10:20:32
 trigexecd: @(#)20/trigexecd.c  20.18   11/04/2003 10:20:32
 
Currently Loaded Modulefiles:
  1) modules     3) totalview   5) craytools   7) mpt         9) libsci     11) PrgEnv     13) motif
  2) X11         4) cal         6) cftn        8) CC         10) craylibs   12) pbs        14) open

If you don't have the "open" module loaded you can run "pe-version" by specifying the full path to the executable.

E.g.:

klondike 2% /opt/open/open/bin/pe-version

While "pe-version" doesn't show versions for all loaded packages it does include the modules used by most users.

SC05 Technical Papers Submission Deadline April 18

SC05 ( http://sc05.supercomputing.org/ ) will accept submissions for technical papers beginning Monday, March 28, 2005. The submission deadline is Monday, April 18, 2005.

Technical Papers Submission Website:     http://www.sc-submissions.org/

Submissions Open:    Monday, March 28th, 2005 Submissions Deadline:    Monday, April 18, 2005

Sample submission forms and instructions available at:     http://www.sc-submissions.org/

Bi-Annual Newsletter of Arctic Research Available

ARCUS, the Arctic Research Consortium of the United States, has just delivered its Winter 2004-2005 edition of "Witness The Arctic." This is a hefty but readable, 32 page review of current Arctic research.

You can download the pdf or subscribe, at:     http://www.arcus.org/Witness_the_Arctic/Winter_04_05/Contents.html

The Arctic Climate Impact Assessment, released last November, is the subject of the lead article.

Quick-Tip Q & A


A:[[ Okay, I've settled on one favorite shell (ksh) which works on all six
  [[ (yes, 6!) different flavors of Unix I use every week.  Now my problem
  [[ is the shell initialization files, .profile and .kshrc.  I'll add a
  [[ new alias on one system, a library path on another, a nifty prompt
  [[ here, a function there... and now my dot files are different, and I
  [[ can't remember which alias I've got here, which there, and it's
  [[ making me crazy.  Has anyone else ever had this problem?  And solved
  [[ it!?

#
# Many thanks to Wendy Palm:
#
I've run into similar problems before and he's right, this can be
annoying.  Especially since you can't just copy the same .profile over
to every login you have.  Operating systems are different, and
therefore your environment needs to be different too (you don't want
to be trying to load Cray modules on your SGI, do you?).  

Therefore, what you need to do is to create your own environment file
with all the stuff you want the same on all the machines (like primary
and secondary prompt, aliases, general functions etc.), and source it
in the various login dotfiles, leaving all the system-specific things
in the login dotfiles.

For instance, I'd call this something like ".personal": 

.personal
#------------
PS1=`hostname`"> " 
# whatever functions and aliases you set up
#------------

Then, in the .profile on each homedir on each system, put in:
  . $HOME/.personal

Then, if you want to change/add a function, you'd make the change once
and just copy the file to all the systems you have accounts on.

Not all systems pay attention to the .kshrc file (Crays in particular),
so I tend to not use it at all.

Hope this helps.


#
# Here's a variation from the editor.
#

I maintain a file called "ALL.profile" on my desktop system. It contains
all my general purpose aliases, functions, etc. and then, in a series of
korn shell "if" blocks, all the settings for specific systems.

If I need to change something for the IBM, for example, I edit
ALL.profile and then run a script which pushes the new ALL.profile out
to all the other systems as ~/.profile.  The advantage is that I have
one, and only one, profile file to worry about.

Here's a heavily edited sample:

#--------------------------------------------------------------------
# ALL.profile
#--------------------------------------------------------------------
export OSNAME=$(uname)

alias           \ 
  mo='more '    \ 
  mor='more '   \ 
  mroe='more '  

#--------------------------------------------------------------------
# IRIX
#--------------------------------------------------------------------
if [[ IRIX = ${OSNAME} ]]; then
  # Set the interrupt character to Ctrl-C and do clean backspacing.
  if [ -t 0 ]
  then
    stty intr '^C' echoe 
  fi
fi 

#--------------------------------------------------------------------
# Mac OSX
#--------------------------------------------------------------------
if [[ Darwin = ${OSNAME} ]]; then
  alias vim='open -a ~/Applications/vim/Vim.app ' 
fi


#
#    ... BONUS ANSWER ... 
# 
# Here's another solution to the web page retrieval question.
# Thanks to Dale Clark: 

Lacking curl, lynx, or wget, one can always use telnet to fetch a
page, in a minimalist manner. Example:

  loon> telnet www.arsc.edu 80
  Trying 199.165.84.118...
  Connected to www.arsc.edu (199.165.84.118).
  Escape character is '^]'.
  GET /misc/staff.html HTTP/1.0

... # Document text omitted.

Note that two carriage returns must follow the GET command.


Q: I often find myself comparing versions of source files trying to
   figure out what changed between the versions.  Good old-fashioned
   diff works just fine, but there's got to be a more modern solution.
   Do you know of any text editors or other tools that have file
   comparison functionality built in?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top