Asymptotic Benchmark Results at ARSC

I have a few results that I wanted to share before I leave ARSC. As I've pointed out in the ARSC training courses, there is a big difference between the floating point performance of RISC processors and the memory systems that supply operands for these processors. The T3D processor is one such RISC and if the operands are not in cache, the memory can not keep up with the processor's floating point units. The Y-MP line of computers, on the other hand, have a substantial memory system supporting their fast floating point units.

At the end of this article, I show the results of the STREAM benchmark and a SPEED benchmark in the same table. The results are a combination of the timings I've done on machines available at ARSC and timings that are publicly available.

Tom Parker's SPEED Benchmark

From Tom Parker of Consulting Services at the National Center for Atmospheric Research in Boulder, Colorado, I received a small Fortran program that gets a very high MFLOPS rate for both vector processors and RISC processors. The program is short and is given below:
```
ccccccccccccccccccccccccccccccccccccccccccccccccc
c Get MFLOP rate of CRAY.
c
c 20SEP95 Tom Parker, SCD Consulting Office.
ccccccccccccccccccccccccccccccccccccccccccccccccc
parameter ( n = 1 000 000 )
double precision x(n),z(n),a(0:10)
double precision second, t0, t
do i = 0, 10
a( i ) = i + .1        ! initialize polynomial coefficients
enddo
do i = 1, n
x( i ) = i + .1        ! initialize values to evaluate polynomial
enddo
t0 = second()
do i = 1,n
z(i)=(((((((((a(0) *   ! evaluate polynomial using
&      x(i)+a( 1)) *       ! Horner's method
&      x(i)+a( 2)) *
&      x(i)+a( 3)) *
&      x(i)+a( 4)) *
&      x(i)+a( 5)) *
&      x(i)+a( 6)) *
&      x(i)+a( 7)) *
&      x(i)+a( 8)) *
&      x(i)+a( 9)) *
&      x(i)+a(10)
enddo
t=second()-t0
write ( 6, 600  ) ( 20.0 * n ) / ( t * 1000000.0 ) ! mflop/s
600  format( f10.3 )
call dummy( z )           ! Fool the compiler into doing the work !
end
subroutine dummy( z )
end
```
The purpose of the program is to attain near peak performance in Fortran and still do something useful. I used to think that the vendor's implementation of the 1000x1000 linpack case was the closest you could get to peak performance but those sources are probably not publicly available, certainly not simple, and may not be in Fortran.

Parker's program should run well on both vector processors and RISC processors. Of course, it vectorizes on the CRAY Y-MP computers and there is little memory activity, just a vector load in the beginning and a vector store at the end. There are many overlapping or chained vector multiplies and adds in the body of the loop. The compiler can schedule the scalar loads of the invariants a(0:10) without conflict into the 8 scalar registers. On all Cray computers, the -dp switch is used to ensure that double precision is implemented with 64 bits not 128 bits. Also on all Cray computers, no optimization flags are used, as optimization is on by default.

On the RISC processors, the one stream of inputs is accessed in a cache friendly manner and the invariants a(0:10) can reside for the life of the loop in the 32 registers. As on all RISCs, double precision is 64 bits by default. I'm willing to experiment to find the best optimization switch -O?, but nobody except the vendor has time to test all possible compiler switches to find the optimal combo. On both the SGIs and the Crays, the compiler will optimize out the entire loop unless the call to the dummy routine is used to trick the compiler into thinking the computed results will be used. Similarly, we need to use coefficients and values to evaluate that the compiler will not special case.

Both Tom Parker and I are looking to get more timings with this source. We are particularly interested in C90 and T90 single processor results. RISC processors, on this program, don't get as close to their peak performance as do the vector processors. If there's something I've missed for RISC processors I'd like to hear about it.

The STREAM Benchmark

Another publicly available benchmark is the STREAM benchmark by John McCalpin of the University of Delaware, Graduate College of Marine Studies, Ocean Modeling Research Group. There is a great description of the benchmark and results on many machines at:
```
http://perelandra.cms.udel.edu/hpc/stream/```
A brief description of the benchmark is:
```
> The STREAM benchmark is a simple synthetic benchmark program that measures
> sustainable memory bandwidth (in MB/s) and the corresponding computation rate
> for simple vector kernels.
```
The DO loops timed are:
```
>  --------------------------------------------------------
>  name      kernel                bytes/iter    FLOPS/iter
>  --------------------------------------------------------
>  COPY:     a(i) = b(i)               16            0
>  SCALE:    a(i) = q*b(i)             16            1
>  SUM:      a(i) = b(i) + c(i)        24            1
>  TRIAD:    a(i) = b(i) + q*c(i)      24            2
>  --------------------------------------------------------
```
From the source available at the above web site, we have that the length of these DO loops (for loops in the C source) is 2,000,000 iterations. In this aspect, the behavior is similar to Tom Parker's program, in that only asymptotic behavior is measured. But unlike the polynomial evaluation program, these loops have large, varying memory requirements. The number of iterations has been chosen so that the loop operands can not all reside in cache, and therefore, the loops mimic memory system performance for large problems. John McCalpin does a much better job explaining and justifying the benchmark in the papers available on the above web page. But for large programs that are not cache contained, these results are important.

The table below is a mixture of results:

```
# from Tom Parker
all ARSC timings are measured by Mike Ess
all other from John McCalpin's STREAM web page

<---------Bandwidth (MB/s)---------->     Mflops
Machine            NCPUs       Copy     Scale       Sum     Triad   T. Parker's
---------------    -----   --------  --------  --------  --------    --------
measured/
theoretical
Machines at ARSC
Onyx L R4400(150Mhz)onyx2      59.3      57.1     57.1       61.5      38/??
Onyx Reality Engine video1     59.3      57.1     57.1       61.5      38/??
Indy R4600PC(100Mhz)amstel     61.5      57.1     53.3       60.0      27/??
Indy R4600PC(133Mhz)kvasir     45.7      42.1     42.9       50.0      26/??
Indy R4400(100Mhz)guinness     48.5      44.4     44.4       42.1(C, no f77)
T3D(1PE)                      173.2     132.6    122.6      108.4      22/150
T3D(1PE,-Drdahead=on)         231.4     181.3    122.6      108.4      24/150
Y-MP M98                      1775.0    1801.9   1929.7      1905.2    282/333

Cray machines
Cray_T90               1    11341.5   10717.5   14783.6   13920.0
Cray_C90              16   105497.4  104656.4  101736.1  103812.8
Cray_C90               8    55071.9   55391.8   60843.3   63229.6
Cray_C90               4    27610.3   27789.6   34633.3   35044.1
Cray_C90               2    13866.0   13905.5   18233.2   18246.3
Cray_C90               1     6965.4    6965.4    9378.7    9500.7
Cray_Y/MP              8    19291.6   19294.2   26588.9   26802.2
Cray_Y/MP              4     9685.8    9678.9   13781.4   13851.2
Cray_Y/MP              1     2426.4    2426.2    3454.4    3396.9     305/333#
Cray_J932             32    19007.0   18944.1   19993.9   18870.4
Cray_J932             16    16298.2   15851.5   15657.6   14995.9
Cray_J932              8     9995.2    9726.8    9087.4    8941.3
Cray_J932              4     5255.3    5094.9    4688.3    4657.6
Cray_J932              2     2842.2    2766.3    2493.7    2527.6
Cray_J932              1     1433.6    1408.6    1260.8    1270.0     183/200#
Cray_EL-98             8     2362.8    2310.5    2373.7    2363.8
Cray_EL-98             4     1564.9    1569.8    1933.8    1955.5
Cray_EL-98             2      826.7     833.8    1049.0    1078.2
Cray_EL-98             1      437.2     436.7     536.2     476.8      62/66#
Cray_T3D_(assembly)  512   169677.7  166578.1  114976.8  112126.4
Cray_T3D_(assembly)  256    98303.7   84229.5   57622.6   56078.7
Cray_T3D_(assembly)  128    49132.9   42113.9   28811.1   28032.5
Cray_T3D_(assembly)   64    24577.9   21061.6   14405.8   14020.2
Cray_T3D_(assembly)   32    12288.6   10530.7    7204.2    7010.7
Cray_T3D_(assembly)    1      384.5     329.4     225.4     220.1
Cray_T3D_(Fortran)   512   161479.7  168193.5   91775.6   95304.2
Cray_T3D_(Fortran)   256    98316.2   84241.5   47824.1   45248.1
Cray_T3D_(Fortran)   128    49156.4   42128.2   23912.0   22625.2
Cray_T3D_(Fortran)    64    24580.7   21064.7   11955.7   11312.9
Cray_T3D_(Fortran)    32    12290.3   10533.4    5978.3    5656.4
Cray_T3D_(Fortran)     1      384.2     329.2     187.0     176.8
Cray_CS6400           32      824.1     819.6     885.0     882.6
Cray_CS6400           24      761.9     753.7     775.5     774.5
Cray_CS6400           16      611.5     601.0     596.0     594.6
Cray_CS6400            8      347.9     343.4     341.4     342.6
Cray_CS6400            4      188.9     184.3     188.4     188.8
Cray_CS6400            1       51.1      49.9      50.0      50.2

Other machines of interest
DEC_3000/300           1       33.4      33.5      39.6      38.9
IBM_RS6000-990         1      663.4     533.4     714.5     713.8
Intel_Pentium/133      1       84.4      77.1      85.7      85.9
```
Notes from the table:
1. The collection of SGI workstations at ARSC have varying floating point performance but the STREAM benchmark seems to point to a common underlying memory system.
2. I believe that my results on the T3D differ from those submitted by CRI because CRI modified the storage of the arrays to minimize cache and page conflicts. Vendors that supply results are allowed to change the source and experiment with compiler flags. The STREAM web page has the email correspondence of vendors submitting their results, they do not all submit their modified source.
3. Both the STREAM benchmark and SPEED benchmark show that the ARSC M98, with it's DRAM memory, takes a performance hit from the usual Y-MP.
4. The contrast between the Cray_T3D and Cray_CS6400 is the classic problem of shared memory versus distributed memory. There are many more machines listed on the STREAM webpage. (The DEC_3000/300 is the Dec workstation closest to a single T3D PE.)
5. The T3D timings are exceptional, but the STREAM benchmark almost measures performance as:
```
MPP performance = (performance of one PE) * (number of PEs)
```
which is almost never the case with a real application.

Current Editors:
 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
E-mail Subscriptions:
Archives: