ARSC T3E Users' Newsletter 141, April 17, 1998

Storage Upgrade for Yukon Users

As announced in newsletter #139, ARSC is upgrading its storage environment. The upgrade will occur in phases, with the first taking place in a couple of weeks, concurrent with the cut-over from denali to chilkoot. Watch news items for the exact date.

Following the cut-over:

  • CRL will be accessed from chilkoot rather than denali.
  • Two new file systems will be cross-mounted between yukon, chilkoot, and the ARSC network of SGI visualization hosts. They are:

/viztmp: 14GB;
under DMF control;
10 GB quota (per user) for all files (migrated and on-line);
no backups;
files over 10-days old purged;
exported to yukon and ARSC SGIs.
/allsys: 37GB;
under DMF control;
10 GB quota (per user) for on-line (unmigrated) files;
no quota for migrated files;
nightly backups;
exported to yukon and ARSC SGIs.

For more information visit our transition pages, at:

http://www.arsc.edu/pubs/bulletins/Transition.html

MPI Collective Communications

MPI provides a set of functions which perform collective communications between processor groups. This includes the following functions.

  • Barrier synchronization.
  • Global communication functions such as broadcast and gather/scatter of data across processor sets.
  • Global reduction operations which perform gather/scatters and an operation.

Sometimes it is asked, why are all these routines needed, since everything can be done with send and recv? Perhaps the simplest answer is that these frequently used data movement operations can be tailored to particular systems communication topologies, memory, or other hardware features and like many other libraries are tested to ensure correct behavior. This provides the user with both correct code and higher performance.

For example consider a simple broadcast operation. In the code below the broadcast operation is hand-coded using a simple serial send and recv, and this is compared with a call to MPI_BCAST which results in the same data transfer.


      program prog1

      implicit none

      include 'mpif.h'

! information for local and global conditions.
      integer mype
      integer totpes
      integer mproc

! general mpi error flag
      integer ierr
      integer istatus(MPI_STATUS_SIZE)

! dataspaces
      integer, allocatable, dimension(:) :: ia,ib
      integer isize

      integer i,is,ip
      integer ndata,itag,mrproc

!timing
      integer isrtc,ifrtc,isbtc,ifbtc

! setup mpi
      call MPI_INIT(ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, mype, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, totpes, ierr)
      write(6,*) ' mype is ',mype,' of total ',totpes

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)

      mproc=0

      isize=1
      do i=1,20

! setup data for testing
      allocate (ia(isize))
      allocate (ib(isize))

      do is=1,isize
        ia(is)=i
        ib(is)=i
      enddo

! broadcast using send/recv

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      isrtc=irtc()


      if(mype.eq.mproc) then
        ndata=isize
        itag=i
        do ip=0,totpes-1
          if(ip.ne.mproc) then
            call MPI_SEND(ia,ndata,MPI_INTEGER,ip,itag,
     $                         MPI_COMM_WORLD,ierr)
          endif
        enddo

      else

        ndata=isize
        itag=i
        mrproc=mproc
        call MPI_RECV(ia,ndata,MPI_INTEGER,mrproc,itag,
     $                MPI_COMM_WORLD,istatus,ierr)

      endif

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      ifrtc=irtc()



! broadcast using broadcast
      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      isbtc=irtc()

      call MPI_BCAST(ib,isize,MPI_INTEGER,mproc,
     $      MPI_COMM_WORLD,ierr)

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      ifbtc=irtc()



! report timing

      if(mype.eq.mproc) then
        write(6,*) ' send/recv ',isize,' took ',
     $       ifrtc-isrtc,ifbtc-isbtc,ia(isize),ib(isize)
        write(6,*)
     $       totpes,isize,ifrtc-isrtc,ifbtc-isbtc
      endif

      write(10+mype,*) ia(isize),ib(isize)

!increase size
      isize=isize*2
      deallocate(ia)
      deallocate(ib)
      enddo


      call MPI_FINALIZE(ierr)

      end

Note that, for both versions, the processor which hosts the data is identified as mproc and the data is distributed to arrays of the same name on all the processors. The routine MPI_BCAST is called on all processors. If one of the processors in the communicator MPI_COMM_WORLD does not call MPI_BCAST the code will deadlock.

By using MPI_BCAST in one line, the code has done what took tens of lines in user code with MPI_SEND/RECV.

Also the user has coded the broadcast operation so that the master sends data to each slave. The master's linear progression through the other PEs is clearly visualized in the VAMPIR graph shown.

VAMPIR Graph (click on image for larger view)

(For details on VAMPIR see newsletter 129: /arsc/support/news/t3enews/t3enews129/index.xml .)

A simple optimization here is to use a store and forward mechanism whereby each slave may send data to other slaves to reduce the reliance on one processor. However the user would have to reconsider the algorithm with each change of architecture or even for different data sizes. (A simple store forward algorithm takes nearly 100 lines of code.)

By using MPI_BCAST it is hoped that the vendor or supplier of the MPI library will do this work for us! In fact Cray MPI_BCAST uses an unbalanced tree algorithm to broadcast the message when the group is MPI_COMM_WORLD.

Below are some data transfer times comparing a simple hand-code send/recv and MPI_BCAST, where,


  N:      Number of integers broadcast.
  s/r:    Indicates send/recv
  bcast:  Indicates MPI_BCAST


       
       4 PEs         
     16 PEs            
      32 PEs
       
                     
                       
                    
N      
 s/r      
 bcast    
 s/r       
 bcast     
 s/r       
 bcast
-------
----------
----------
-----------
-----------
-----------
-----------
1      
 19157    
 14170    
 100203    
 24761     
 217539    
 31113
32     
 211910   
 91972    
 843725    
 166380    
 1721043   
 210290
1024   
 614992   
 282801   
 2446954   
 539433    
 4793081   
 677011
524288 
 78409833 
 55263055 
 375497958 
 110121727 
 772674099 
 137610295

A graph of the full data set, which shows an interesting stair-step pattern (discussed below), is shown:

MPI_CAST vs Send/Recv (click on image for larger view)

For both the serial send/recv and MPI_BCAST, the time taken increases both with the number of processors and the number of data items being broadcast. For the simple serial send this is linear with processor number, for the MPI_BCAST the stepped structure comes from the use of a tree structure. The data is sent to a few processors first which then send data to other processors which repeat this process until all processors have received the data. The time taken is determined by the number of levels and the steps in the performance curve occur when the addition of one processor causes the next level to be occupied. As more processors are involved the width of the tree at the base level increases.

In conclusion MPI_BCAST is faster than a simple sequential send/recv, the performance advantage being greater the more processors are involved.

!! Skipping Next Issue of Newsletter !!

Next issue: May 15, 1998. We're very busy bringing chilkoot on-line!

Quick-Tip Q & A


A: {{ When I first learned to program, I used "bubble-sort".  Now I use
      "quick-sort".  Is there any faster-yet sorting algorithm? }}

   It depends on your data.  Sorting is a big subject, but assuming
   your data is stored in arrays, here are some ideas.

   Given general data and no advance knowledge of the degree to which
   it is pre-sorted, stick with quicksort.  It sorts in order N*log(N)
   time (where N is the number of elements to sort).

   If you're *sure* that the data is only out of place by a few
   elements, insertion sort will beat quicksort.  This might happen if
   you added a few elements, re-sorted, added a few more, etc.  But be
   careful: in the general case, it's an O(N^2) sort.

   If your data is of an ordinal type (integer or character, for
   instance) and has a sufficiently limited range (integers between 0
   and 1,000,000, for instance), try the bucket sort.  It works in O(N)
   time.

                         

                         
                                             
Q: Can "vi" insert a "
" 
character into the same column over a specific
   range of rows?        

                         

   Even if some of the ro
ws are empty?  
                         

   Or short?             

                         

   Say you wanted to crea
te a crude table (see the last article) or
   delimit a specific col
umn number (column #33, as shown here, for
   instance).            

                         

                         

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top