ARSC T3D Users' Newsletter 76, March 1, 1996

The Leading Dimension in Matrix Operations on the T3D

The linpack benchmark for the 100 by 100 case is distributed with the matrices to be factored as:


  double precision aa(200,200),a(201,200),b(200),x(200)
The benchmark program solves the same problem 8 times, 4 times with matrix a and then 4 times with matrix aa. You can bet that the vendor reports the best of the 8 timings. Vector machines like the CRI machines probably report the times for the matrix a, the one with the leading dimension 201. I guess cache based machines like the DEC alpha report the times for matrix aa, the one with leading dimension 200.

I thought column-oriented algorithms like linpack and matrix multiplication would be independent of such consideration, but Ed Anderson of CRI said I should retry the matrix multiplication timings of the last newsletter:


  > Mike,
  >    A T3D doesn't have banks. Don't declare your arrays with
  > odd leading dimensions; it makes accesses to block rows
  > inefficient. You should get 50+ Mflops from SGEMV if you make
  > LDA a multiple of 4 (the cache line size). This is Example 2
  > in my 1994 paper, "Data layout and its effect on single-
  > processor performance on the CRAY T3D", referenced in your
  > newsletter #18.
(I can e-mail this paper to any one who requests it, The ARSC ftp server no longer contains such material. If you want any T3D files that have been mentioned in these newsletters as being on the ftp server, please contact me and I will e-mail them to you.)

I reran the timing program and the effect of the leading dimension is there:

Table 1

MFLOPS for matrix-vector multiplication methods on the ARSC T3D (leading dimension 1001, inefficient for cache based processors)

       size sgemm  sgemv  mxma  call   call fsaxpy fsaxpy1 fsaxpy2 fsaxpy3 fsdot
                                saxpy  sdot

   1    0    N/A    0.0    N/A    0.0    0.0    0.0    0.0    0.0    0.0    0.0
   2    1    0.1    0.1    N/A    0.1    0.1    0.5    0.5    0.4    0.3    0.4
   3    2    0.4    0.6    N/A    1.1    0.9    2.2    2.3    1.8    2.0    2.2
   4    3    1.0    1.7    N/A    2.4    1.9    4.3    4.6    3.6    4.1    5.3
   5    4    1.4    2.0    N/A    3.6    3.1    6.7    6.7    5.0    6.8    6.7
   6    5    1.8    3.0    N/A    4.1    4.2    8.6    8.6    6.5    8.2    8.4
   7    6    2.9    4.5    N/A    5.1    5.3   10.0    9.9    7.7   10.4   11.0
   8    7    4.1    6.6    N/A    7.0    6.5   11.0   11.5    8.7   11.9   12.9
   9    8    2.8    4.9    N/A    8.2    6.2   11.4   12.1    9.7   17.7   14.2
  10    9    3.6    5.7    N/A    7.7    7.1   12.6   13.4   10.3   17.6   15.4
  11   10    4.3    6.7    N/A    9.0    8.0   13.0   14.0   11.1   18.1   16.8
  12   16    8.5   14.1    N/A    9.6   12.2   14.5   15.4   12.4   24.5   18.4
  13   20    9.2   16.3    N/A   11.3   11.9   15.3   16.4   13.2   26.8   19.9
  14   30   12.7   23.3    N/A   13.8    8.0   11.9   12.8   10.9   18.5   15.1
  15   32   15.1   29.3    N/A   15.1   13.1   12.1   12.5   10.7   19.1   14.6
  16   40   15.3   30.4    N/A   16.5   13.1   11.4   11.8   10.0   17.6   13.4
  17   50   16.1   31.7    N/A   17.1   12.2   11.1   11.4    9.6   16.5   11.7
  18   60   17.2   34.8    N/A   18.5   12.7   11.3   10.3    9.8   17.2   11.9
  19   63   16.6   32.8    N/A   18.0   12.4   11.3   11.6    9.8   16.8   11.9
  20   64   17.8   36.5    N/A   14.2    5.2   10.2   11.7    9.8   17.4   11.9
  21   65   16.8   34.5    N/A   14.3    5.2   11.3   11.6    9.8   17.2   11.9
  22   70   16.8   25.9    N/A   14.4    5.2   11.4   11.7    9.8   17.3   11.9
  23   80   18.6   37.2    N/A   15.8    5.2   11.5   11.8    9.9   17.8   12.0
  24   90   17.8   34.7    N/A   16.4    5.2   11.3   11.6    9.6   16.8   11.0
  25  100   17.9   35.7    N/A   15.9    5.5   11.3   11.7    9.7   16.1    9.5
  26  128   18.6   38.5    N/A   19.0    5.6   11.4   11.8    9.6   17.4    7.4
  27  200   18.3   38.4    N/A   21.6    5.8   11.3   11.5    9.6   17.3    6.2
  28  256   18.6   39.3    N/A   22.6    5.9   11.2   11.5    9.4   17.4    6.1
  29  300   18.7   38.7    N/A   23.4    5.9   11.0   11.4    9.5   17.2    6.1
  30  400   18.9   39.5    N/A   24.2    6.0   10.8   11.2    9.3   17.0    5.9
  31  500   18.8   39.0    N/A   24.9    6.1   10.6   10.9    9.1   16.9    5.8
  32  512   18.8   39.5    N/A   25.1    6.1   10.6   10.9    9.1   16.9    5.8
  33  600   18.8   39.2    N/A   25.4    6.1   10.5   10.8    9.0   16.7    5.7
  34  700   18.9   39.4    N/A   25.7    6.1   10.2   10.6    8.8   16.6    5.6
  35  800   18.9   39.5    N/A   25.9    6.1   10.1   10.4    8.7   16.4    5.4
  36  900   18.8   39.3    N/A   26.1    6.1    9.9   10.2    8.5   16.3    5.4
  37 1000   18.9   39.3    N/A   26.3    6.1    9.7   10.0    8.4   16.3    5.4

Table 2

MFLOPS for matrix-vector multiplication methods on the ARSC T3D (leading dimension 1000, efficient for cache based processors)

       size sgemm  sgemv  mxma  call   call fsaxpy fsaxpy1 fsaxpy2 fsaxpy3 fsdot
                                saxpy  sdot

   1    0    N/A    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
   2    1    0.1    0.1    N/A    0.1    0.1    0.5    0.5    0.4    0.3    0.5
   3    2    0.4    0.7    N/A    1.0    0.9    2.3    2.3    1.8    1.9    2.3
   4    3    1.1    1.8    N/A    2.1    1.9    4.6    4.2    3.6    4.1    5.5
   5    4    1.4    2.2    N/A    3.5    3.0    6.7    6.7    5.0    6.8    6.9
   6    5    1.8    3.1    N/A    4.1    4.2    8.2    8.2    6.5    8.4    8.8
   7    6    3.0    4.7    N/A    5.0    5.3    9.6   10.1    7.9   10.1   11.5
   8    7    4.1    6.7    N/A    6.6    6.4   10.9   11.5    8.7   11.7   13.0
   9    8    2.9    5.3    N/A    7.8    6.1   11.6   12.0    9.5   16.6   14.8
  10    9    3.7    5.8    N/A    7.5    7.2   12.5   13.1   10.3   17.5   15.4
  11   10    4.4    6.8    N/A    8.5    7.9   13.0   13.8   11.0   18.0   17.0
  12   16    9.0   15.6    N/A   10.2   11.2   14.6   15.2   12.3   24.4   18.5
  13   20    9.7   18.0    N/A   12.2   11.9   15.4   16.2   13.1   26.6   19.6
  14   30   13.7   24.5    N/A   14.0   13.8   13.2   13.8   11.6   20.9   16.6
  15   32   16.9   33.7    N/A   16.0   14.5   13.4   14.1   11.8   22.5   16.9
  16   40   11.7   34.4    N/A   17.2   14.4   12.3   12.7   10.9   20.1   15.2
  17   50   17.9   36.0    N/A   17.5   12.9   11.4   11.7   10.2   17.4   12.4
  18   60   19.6   41.6    N/A   19.4   13.1   11.6   12.0   10.4   18.2   12.7
  19   63   18.8   38.6    N/A   15.5   13.0   11.6   12.0   10.4   17.6   12.6
  20   64   19.9   43.8    N/A   15.0    5.2   11.7   12.0   10.4   18.3   12.7
  21   65   15.7   39.9    N/A   14.6    5.2   11.4   11.7   10.2   17.7   12.5
  22   70   18.4   39.3    N/A   14.8    5.3   11.6   10.8   10.4   17.7   12.6
  23   80   20.8   45.6    N/A   16.7    5.4   11.8   12.1    9.8   18.4   12.8
  24   90   19.5   41.0    N/A   16.9    5.5   11.6   11.9    9.8   17.8   11.5
  25  100   19.6   43.6    N/A   17.9    5.6   11.2   12.1   10.4   18.4   11.6
  26  128   20.6   46.3    N/A   19.0    5.7   11.5   11.6   10.2   17.9   11.0
  27  200   20.2   46.4    N/A   22.6    5.8   11.5   11.7   10.1   17.9    6.7
  28  256   20.4   47.7    N/A   23.4    5.9   11.2   11.6   10.0   17.5    5.9
  29  300   20.5   47.7    N/A   24.0    6.0   11.2   11.5    9.9   17.6    5.9
  30  400   20.6   47.5    N/A   25.2    6.0   10.9   11.2    9.6   17.3    5.9
  31  500   20.4   47.4    N/A   25.8    6.1   10.7   11.0    9.5   17.2    5.9
  32  512   20.5   47.8    N/A   26.0    6.1   10.6   10.9    9.4   16.9    5.9
  33  600   20.6   47.6    N/A   26.4    6.1   10.4   10.7    9.2   16.8    5.9
  34  700   20.5   47.8    N/A   26.5    6.1   10.2   10.5    9.0   16.6    5.9
  35  800   20.5   47.9    N/A   26.9    6.1   10.0   10.2    8.8   16.4    5.9
  36  900   20.4   47.8    N/A   27.0    6.1    9.7   10.0    8.6   16.3    5.9
  37 1000   20.5   47.9    N/A   27.2    6.1    9.5    9.8    8.5   16.3    5.9
For my timings, I probably didn't get to 50 MFLOPS because we haven't moved up to the 2.0 PE.

Another Cache Related Example

I recently worked on an optimization effort where the following loop nest dominated the execution time:

  do j = 1, bkgd_j(band)
     do i = 1, bkgd_i(band)
        if(s(band,i,j) .gt. 0.0)then
           do l = -s2,s2
              if ((j+l .ge. 1) .and. (j+l .le. bkgd_j(band))) then
                 do k = -s2, s2
                    if ((i+k .ge. 1) .and. (i+k .le. bkgd_i(band))) then
                       img(i+k,j+l) = img(i+k,j+l) + s(band,i,j) * r(band,k,l)
                    endif
                 endo
              endif
           enddo    
        endif         
     enddo
  enddo
For this particular loop nest, the IF statement can be incorporated into the loop bounds:

  do j = 1, bkgd_j(band)
     do i = 1, bkgd_i(band)
        if(s(band,i,j) .gt. 0.0)then
           do l = -s2,s2
              if ((j+l .ge. 1) .and. (j+l .le. bkgd_j(band))) then
                 k0 = max0( -s2, 1-i )
                 k1 = min0( s2, bkgd_i(band)-i )
                 do k = k0, k1
                       img(i+k,j+l) = img(i+k,j+l) + s(band,i,j) * r(band,k,l)
                 endo
              endif
           enddo   
        endif       
     enddo
  enddo
This modification produced 2X speedup on some problems.

Because most of the time is concentrated in the loop nest, it is worthwhile to look at the innermost loop of the loop nest more carefully:


  k0 = max0( -s2, 1-i )
  k1 = min0( s2, bkgd_i(band)-i )
  do k = k0, k1
    img(i+k,j+l)=img(i+k,j+l)+s(band,i,j)*r(band,k,l)
  enddo
The variable s(band,i,j) is a loop invariant and so doesn't change within the loop. The compiler loads the value into a register and it doesn't change for the life of the loop. The variable img(i+k,j+l) is optimally accessed in column order and so makes good usage of cache. Problematically, the access of the variable r(band,k,l) is not accessed in a cache optimal way. The value of band is also an invariant of the loop nest and the leading dimension of r is 5 for this program. As the variable k increases within the loop every fifth element of the array is loaded from memory and four-fifths of the cache bandwidth is wasted.

For this program it was possible to copy the slice of the array r for a fixed value of band to another two dimensional array outside of this loop nest. With this done now both invariants in the array are accessed contiguously. (It even becomes an instance of a call to the libsci routine saxpy.) In this particular case, for even relatively small loop lengths, these two changes produced a 3X speedup. It now looks like:


  k0 = max0( -s2, 1-i )
  k1 = min0( s2, bkgd_i(band)-i )
  do k = k0, k1
    img(i+k,j+l)=img(i+k,j+l)+s(band,i,j)*temp_r(k,l)
  enddo
This modification is a nice example of trading space for speed.

License Renewal for pghpf (Portland Group High Performance Fortran)

The license for ARSC's evaluation copy of the Portland Group's High Performance Fortran compiler has temporarily expired. They have graciously given us another extension and the new license will be installed soon.

MPI Keeps on Growing


  > Parallel Programming with MPI
  > March 5 & 6, 1996 at OSC
  >
  > The Ohio Supercomputer Center (OSC) is offering a two-day
  > course on using the Message Passing Interface (MPI) standard
  > to write parallel programs on several of the OSC MPP systems.
  > For more information on MPI, see
  > http://www.osc.edu/Lam.html#MPI on the WWW.
  >
  > MPI topics to be covered include a variety of processor-to-
  > processor communication routines, collective operations
  > performed by groups of processors, defining and using high-
  > level processor connection topologies, and user-specified
  > derived data types for message creation.
  >
  > The MPI workshop will be a combination of lectures and
  > hands-on lab session in which the participants will write
  > and execute sample MPI programs.
  >
  > Interested parties should contact Aline Davis at
  > aline@osc.edu or (614) 292-9248. Due to the hands-on nature
  > of the workshop, REGISTRATION IS LIMITED TO 20 STUDENTS.

T3D Homepages

For a new user it is hard to get started on the T3D. Probably the best introduction is one of the CRI training classes but it's not always possible to attend. The Web has a lot of content but is not as organized as say class material. I think a good way to get started might be to visit a few web pages and search for the word t3d. Here are some web pages that I like: Or you could read through our T3D Newsletters. There is content out there if you just look!
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top