ARSC T3E Users' Newsletter 139, March 19, 1998

ARSC Announces Storage Upgrade and New J932

This announcement was e-mailed to all active ARSC users last week:

ARSC is pleased to announce the delivery of a new CRAY J932 vector supercomputer. This machine will be used as both a compute engine and a storage server. Our J932 is named Chilkoot after the trail that brought gold miners to the Yukon a century ago. Chilkoot arrives this week and will be available to users in late April. Denali is expected to be decommisioned in late May. ARSC staff will be available to assist users in moving between the two systems.

The J932/128192 is initially configured with 12 CPUs, 8 GB of memory and more than 240 GB of disk; it has ample room for future expansion. Over the next few months, ARSC will be evaluating user needs for additional processors, faster processors and/or additional memory. Your input in this assessment is highly valued. Other upgrades that are currently planned include an upgraded STK Silo which will be attached to the J932; additional disk space to optimize I/O performance on Yukon; and the addition of an ImmersaDesk (a 4 foot by 6 foot, rear-projection interactive stereo video display) to the Visualization Laboratory in the Butrovich building on the UAF Campus.

[ ... ]

For more information:

http://www.arsc.edu/pubs/bulletins/Transition.html

Major Upgrade to Yukon's Storage Environment

More details will be forthcoming, but the announced storage upgrade will provide many benefits and opportunities to yukon users. Here's a quick overview:

  • /tmp file system on new high-speed disk for application support.
  • All yukon disks under DMF control.
  • Duplicate DMF copies of all files.
  • Nightly backups of user file systems.
  • Two new file systems will be cross-mounted between yukon, chilkoot, and the ARSC network of SGI visualization hosts. This will create new opportunities for yukon users. For example, yukon output can be made immediately accessible to CRL (via chilkoot), and to Vis5d, IDL, Alias|Wavefront, and video production equipment (via the SGIs).

Yukon Checkpoint Schedule


>From "news chkpnt_sched" on yukon:

  ARSC has been regularly checkpointing jobs in order to allow jobs in
  the large queues to run and to take system downtime.  We have also
  been contacting all users whose jobs have been checkpointed.

  As of Mar. 16, 1998, we will cease contacting users whose jobs have
  been checkpointed.  As the system has proven very reliable, this is
  no longer deemed necessary.

  We will continue to checkpoint jobs on the schedule,

        Tuesdays  at 16:00 Thursdays at 16:00 Fridays   at 16:00

  generally releasing jobs by 17:00 if downtime is not scheduled for
  later on the given evening.  We will also continue to checkpoint jobs
  at other, unscheduled, times, depending on the system load.

An Embarrassing Problem?

Many users think parallel processing is complex.

Well, getting all those processors to cooperate isn't simple and it does require more effort than serial programming. However as with any complex activity, it is often a good idea to break down the complexity into simpler parts.

The simplest form of parallel processing, and one used by several projects at ARSC, is often termed "embarrassingly parallel." Typically, the user has a lot of very similar items of work to do which don't require any interaction between them.

The code below demonstrates a trivial example of this, where each processor reads and writes to its own files without any interaction with others. Processor 0 reads files dread_a and writes dwrite_a, processor 1 reads dread_b and writes dwrite_b etc. All processors read a common parameters file. Much of the parallelism effort is in getting the data into the files and putting data together to collect results. All of which can be done "by hand" or by serial per/post processing scripts.


!****************************************************************************
      program main

      implicit none

      include 'mpif.h'

!! MPI information.
      integer ierr
      integer myid, numprocs
!! master controlling processor number.
      integer mproc

!! files to use.
      character(80) paramfile,readfile,writefile
      integer isize
      character*1 noff

!! argument list reading data
      integer numargs,i
      integer arglength,ierror
      character*80 argument
      external ipxfargc
      integer ipxfargc

      call MPI_INIT( ierr )
      call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
      call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

!! set which processor is master processor
      mproc=0

      if ( myid .eq. mproc ) then

!! read command line argument
        numargs = ipxfargc()
        if(numargs.gt.1) then
          write(6,*) ' ERROR: only one argument expected/allowed '
        endif
        do i=0,numargs
          call pxfgetarg(i, argument, arglength, ierror)
          if (ierror.ne.0) then
             write(6,*) ' ERROR: Argument ',i,' could not be read ',ierror
          endif
        enddo
        write(6,*) ' working on ',numprocs,' items starting at ',
     $     argument

      endif


!! sync after master work and broadcast data to slaves.
      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      call MPI_BCAST(argument,1,MPI_CHARACTER,mproc,MPI_COMM_WORLD,ierr)

!! determine which files each processor will work on.
      noff=achar(iachar(argument(1:1))+myid)
      print *, "Process ", myid, " of ", numprocs, " doing ",
     $     noff

      paramfile="dparams"
      readfile="dread"
      writefile="dwrite"


      paramfile=trim(paramfile)
!! add identification to end of each filename
      isize=len_trim(readfile)
      readfile=readfile(1:isize)//'_'
      readfile=trim(readfile(1:isize+1)//noff)

      isize=len_trim(writefile)
      writefile=writefile(1:isize)//'_'
      writefile=trim(writefile(1:isize+1)//noff)


      write(6,*) ' processor ',myid,' reads from ',
     $      paramfile(1:len_trim(paramfile)),' and ',
     $      readfile(1:len_trim(readfile)),
     $      ', writes to ',
     $      writefile(1:len_trim(writefile))


!! call code to work on these files.
      call code(paramfile,readfile,writefile)


!! finalise MPi services.
      call MPI_FINALIZE(ierr)
      stop
      end

!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      subroutine code(paramfile,readfile,writefile)


!! mock serial code.
!! NOTE filenames are arguments for code but NO information about the
!! processor number or number of processor is passed down.

      implicit none

!!filenames
      character(80) paramfile,readfile,writefile

!local variables
! channels for open files
      integer ipchan,irchan,iwchan
! data to read
      integer ipdata, irdata
! data to write
      integer iwdata


!! read parameters, same on all processors.
      open(ipchan,file=paramfile,status='old',err=999)
      read(ipchan,*) ipdata
      print *, ' code  reads parameter ',ipdata

!! read different input according to filename
      write(6,*) ' readfile is ',readfile
      open(irchan,file=readfile,status='old',err=999)
      read(irchan,*) irdata


!! simple work.
      iwdata=irdata+ipdata

!! write output to a specific file.
      write(6,*) ' writefile is ',writefile
      open(iwchan,file=writefile,status='new',err=999)
      write(iwchan,*) iwdata


      return


!! very simple error trap/should be much improved in a real code.
 999  write(6,*) ' error in code!!! '
      return
      end

!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

The task in the procedure, "code," is simply what was the main program of the serial version, with the added argument for the work packet number added. IO statements might need to be modified to read from a file with the work packet number included. Here, a letter has been appended to identify each packet, but users could form this identification in many other ways. In this example only one letter is used, so the maximum number of tasks processed at any one time is 26.

Extra demands are placed on the code if this is to work successfully:

Error Handling.
Since MPI_FINALIZE is actually a synchronization event if any code quits with an error, the code will hang at MPI_FINALIZE once other processors reach this point. In this example errors with the filesystem are caught and the code procedure exits. Other errors should be handled in the same manner, returning to the main program wherever possible. Errors such as floating point divide by zero, etc. cause all processors to abort meaning all work is incomplete not just the problem assigned to the processor reporting the error.
Load Balance.
To perform well, each run of code with different datafiles should require approximately the same effort. In a future article the changes needed to account for different task sizes in each code run will be considered.

One of the greatest advantages of this approach is that it can use any number of processors (assuming the code is written correctly.) In the example below, there is a single command line argument which states an offset when the parallel tasks are to start. (See Newsletter #91 for more information on dealing with command line arguments.)

For example the user could run 4 tasks as one program as below:


  yukon% mpprun -n 4 ./prog a
    working on  4  items starting at a
   Process  0  of  4  doing a
    processor  0  reads from dparams and dread_a, writes to dwrite_a
   Process  2  of  4  doing c
   Process  1  of  4  doing b
   Process  3  of  4  doing d
    code  reads parameter  23
    processor  2  reads from dparams and dread_c, writes to dwrite_c
    processor  1  reads from dparams and dread_b, writes to dwrite_b
    processor  3  reads from dparams and dread_d, writes to dwrite_d
    readfile is dread_a
    code  reads parameter  23
    code  reads parameter  23
    code  reads parameter  23
    writefile is dwrite_a
    readfile is dread_c
    readfile is dread_b
    readfile is dread_d
    writefile is dwrite_d
    writefile is dwrite_b
    writefile is dwrite_c
   STOP (PE 0)   executed at line 91 in Fortran routine 'MAIN'
   STOP (PE 2)   executed at line 91 in Fortran routine 'MAIN'
   STOP (PE 1)   executed at line 91 in Fortran routine 'MAIN'
   STOP (PE 3)   executed at line 91 in Fortran routine 'MAIN'

In this case, processor 0 works on _a files, processor 1 works on _b files, etc.

Alternatively, the user could run two sets of two tasks:


  yukon% mpprun -n 2 ./prog a
    working on  2  items starting at a
   Process  0  of  2  doing a
    processor  0  reads from dparams and dread_a, writes to dwrite_a
   Process  1  of  2  doing b
    code  reads parameter  23
    processor  1  reads from dparams and dread_b, writes to dwrite_b
    readfile is dread_a
    writefile is dwrite_a
    code  reads parameter  23
    readfile is dread_b
    writefile is dwrite_b
   STOP (PE 1)   executed at line 91 in Fortran routine 'MAIN'
   STOP (PE 0)   executed at line 91 in Fortran routine 'MAIN'


  yukon% mpprun -n 2 ./prog c
    working on  2  items starting at c
   Process  0  of  2  doing c
    processor  0  reads from dparams and dread_c, writes to dwrite_c
   Process  1  of  2  doing d
    code  reads parameter  23
    processor  1  reads from dparams and dread_d, writes to dwrite_d
    readfile is dread_c
    writefile is dwrite_c
    code  reads parameter  23
    readfile is dread_d
    writefile is dwrite_d
   STOP (PE 1)   executed at line 91 in Fortran routine 'MAIN'
   STOP (PE 0)   executed at line 91 in Fortran routine 'MAIN'

Given this level of flexibility, the user can optimize his or her runs based on the current job mix on the T3E.

The user may determine how long it will take to get the necessary resources for a single short run using many PEs, versus a longer run using fewer PEs. On a heavily loaded system, the wait for a large block of PEs can increase the wall-clock time.

The above code provides a vary basic example of the use of parallel processing to work on many separate tasks. There are many possible modifications which could improve performance. For example,

  • The file params could be read on one processor and then broadcast. This would reduce contention in the filesystem as many processors try to read a single file. See man MPI_BCAST
  • Access to the multiple files could be improved. The master processor mproc might read a list of the work to be done by each processor rather than relying on the fixed scheme, above. Also, results could be written to a random access file rather than separate files.
  • Error handling, as already discussed, could be improved if code had an argument added to return an error code. The controlling processor could use this to produce a useful warning to the user that there was a problem with particular tasks.

There are several limitations which are also important in deciding if this approach is productive.

  • Each task must only require the memory of a single processor on the system.
  • Each task must have a very similar computational task. If any one task takes much longer, then other processors will be held idle.

These and other considerations must be weighed against programmer time and expertise, and the exact requirements of the task.

Quick-Tip Q & A


A: {{ You've ftp'd a file to the Cray from your PC.  There is a load of
      ^M characters at the ends of the lines.  How can you get rid 
      of these? }}


  Here's the "vi" solution:
  
      Open the file in "vi".  Execute the command:
      
      :%s/^M//g
      
      Here's an explanation:

      :            (puts you in ed mode)
      %            (selects all lines in the file)
      s/^M//       (substitutes ^M with the null string--on all selected lines)
      g            (makes all possible substitutions--on all selected lines. 
                    without the "g", only the first ^M would be removed.)


      To produce the ^M character, enter a CTRL-v followed by a CTRL-m.
      (This technique is also required for the second reader solution, below, if
      you edit the script using vi.)



  We also received two reader responses to this question, both of which
  had to be paraphrased a bit (sorry...) :

      1--
         Remove all original software from the PC and install Linux? :)


      2--
         #!/bin/sh
         ########################################################## 
         # dos2unix - removes all those stupid "^M's" from a DOS 
         # text file copied to Unix.      
         # 
         # Usage: dos2unix <dosfile>
         ##########################################################
         sed 's/^M//g' $1




Q: How can I determine if my T3E job was checkpointed, held, and
   released in the middle of its run?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top