## ARSC HPC Users' Newsletter 235, December 14, 2001

### QUIZ: Vectorization

Here's a simple program to implement trapezoidal integration. Can I get the loop to vectorize on the SV1ex, and improve its performance?

```
CHILKOOT\$ cat trap.serial.f
!******************************************************
! serial.f -- calculate definite integral using trapezoidal rule.
!
! The function f(x) is hardwired.
! Input: From file 'trap.input':  a, b, n
! Output: estimate of integral from a to b of f(x)
!    using n trapezoids.
!******************************************************

PROGRAM serial
implicit none
real :: integral      ! accumulates sum of trapezoids
real :: a             ! lower value of interval
real :: b             ! upper value of interval
integer :: n          ! number of trapezoids
real :: h             ! width of trapezoid
real :: side(0:1)     ! sides of trapezoid
integer:: i           ! which trapezoid computing

real :: f             ! real valued function integrating

open (unit=44, file='trap.input', status='old')
read (44, '(2f8.5,i12)') a,  b,  n
close (44)

h = (b-a)/n

integral = 0

! Left side of first trapezoid
side(1) = f(a+0*h)
do i = 0 , n-1

! Right side of current trapezoid. Left side of next.
side(MOD(i,2)) = f(a+(i+1)*h)

integral = integral + h*(side(0) + side(1)) / 2.0
enddo

print *,'With n =', n,' trapezoids, our estimate'
print *,'of the integral from ', a, ' to ',b, ' = ' , integral
end

!******************************************************
real function f(x)
real x

f = 4.0 / (x**2  + 1)

return
end
!******************************************************
```

When compiled for loopmark listing, as follows:

```
f90 -O3 -rm -o trap.serial trap.serial.f
```

we can see in the listing file that the loop is marked with "1's" and not "V's", which tells us it's not vectorized. From the listing file:

```
32.  1--<       do i = 0 , n-1
33.  1
34.  1              ! Right side of current trapezoid. Left side of next.
35.  1              side(MOD(i,2)) = f(a+(i+1)*h)
36.  1
37.  1              integral = integral + h*(side(0) + side(1)) / 2.0
38.  1-->       enddo
```
The run takes a long time, and it only gets 24 MFLOPS. Here's output including hpm statistics:
```
CHILKOOT\$ hpm ./trap.serial
With n = 50000000  trapezoids, our estimate
of the integral from  0.E+0  to  1.  =  3.141592457548199

Group 0:CPU seconds   : 20.90579      CP executing     :  10452894570

Million inst/sec (MIPS) :   193.76      Instructions     :   4050623322
Avg. clock periods/inst :     2.58
% CP holding issue      :    45.93      CP holding issue :   4801389871
Inst.buffer fetches/sec :     0.00M     Inst.buf. fetches:         9512
Floating adds/sec       :     9.57M     F.P. adds        :    200000514
Floating multiplies/sec :    11.96M     F.P. multiplies  :    250000430
Floating reciprocal/sec :     2.39M     F.P. reciprocals :     50000002
Cache hits/sec          :     9.59M     Cache hits       :    200397353
CPU mem. references/sec :    28.72M     CPU references   :    600421500

Floating ops/CPU second :    23.92M
```

The QUIZ: How can we speed this up?

### Tuning a C++ MPI Code with VAMPIR: Part II

[ Part II of III. Thanks to Jim Long of ARSC for this series of articles. ]

In part I, we described a port of the UAF Institute of Arctic Biology's Terrestrial Ecosystem Model (TEM) to the Cray T3E and a linux cluster, and examined performance using VAMPIR. In this article, we explore an optimization to the communication algorithm and discuss performance on ARSC's IBM SP3.

As shown in part I, VAMPIR images suggested that TEM might be tuned by:

1. overlapping computation on the master and slaves, and
2. having the slaves begin computing as soon as they receive new data.

The relevant abstracted code section from the original implementation is:

```
if (mype == 0){
currentPE = 1;
while (currentPE<totpes){
READ CLIMATE DATA FOR CURRENT SLAVE (if available)
MPI_Barrier(MPI_COMM_WORLD);
MPI_Send CLIMATE DATA TO CURRENT SLAVE (many MPI_Send calls)
currentPE++;
}
}
else {
for (currentPE = 1; currentPE < totpes; currentPE++){
MPI_Barrier(MPI_COMM_WORLD);
if (mype == currentPE) MPI_Recv DATA FROM MASTER (many MPI_Recv calls)
}
COMPUTE WITH MY DATA
}
```

The MPI_Barrier call serves to synchronize the two loops so as not to overload MPI buffering. This could be a problem because the code above is inside a loop that can read files for hundreds of years into the future.

The barriers also mimic the situation that would exist in a synchronous coupling with a climate model, i.e., when there is no new climate data for the master to read until the slaves have computed and sent their data to the climate model. In a synchronous coupling, the master must wait until a new climate is computed.

In a sensitivity analysis for an uncoupled TEM, however, the climate might well be prescribed (as it is now), and the master can read the next year's data and have it ready for the slaves when they need it. This addresses issue 1, above.

The fact that no slave can begin computation until all slaves receive their data was recognized in the original implementation, but was left unchanged since it mimics the worst case scenario that would exist in a global run with many slaves trying to read/write their data at the same time. Worst case simulation is not necessary, however, when a sensitivity analysis is desired for only Arctic latitudes. This addresses issue 2.

Thus, it was safe to tune the code by simply removing the barrier calls. This eliminates the "for" loop in the "else" clause. The first in the series of MPI_Sends was replaced with an MPI_Ssend. MPI_Ssend is a synchronous send that guarantees that the send will not return until the destination begins to receive the message. This effectively implements a barrier between the master and one slave only, when that slave begins to receive, instead of having to stop at an explicit barrier when each slave is receiving. A slave may now begin computation as soon as it receives its data. The tuned code looks like:

```
if (mype == 0){
currentPE = 1;
while (currentPE<totpes){
READ CLIMATE DATA FOR CURRENT SLAVE (if available)
MPI_Ssend for the first of many MPI_Send calls
MPI_Send CLIMATE DATA TO CURRENT SLAVE (many MPI_Send calls)
currentPE++;
}
}
else {
MPI_Recv DATA FROM MASTER (many MPI_Recv calls)
COMPUTE WITH MY DATA
}
```

Results:

The general lesson here is to avoid global barriers if at all possible.

Figure 1 (click on icon for larger view)

Figure 1 gives two VAMPIR images, comparing old vs new communication patterns for the T3E during equal timeslices of the TEM transient portion. The T3E showed a roughly 10% reduction in time for the transient portion of the run, which shows up as a reduction in the amount of time spent in (red) MPI calls for the slaves in the VAMPIR output. (In all of these VAMPIR images, green, which shows time spent doing computation, is good, while red, which shows necessary, but unproductive, time in the communication library, is bad.)

Figure 2 (click on icon for larger view)

Figure 2 shows the communication pattern on the ARSC linux cluster, using ethernet, where an impressive 40% reduction in time during transient portion is realized.

Of all platforms tested, MPI latency and bandwidth are worst on the cluster using ethernet, thus it's no surprise that the benefits from tuning the communication algorithm is most dramatic here.

Figure 3 (click on icon for larger view)

Figure 3 shows additional results from the ARSC linux cluster, but this time, using the myrinet network. This comparison shows about a 15% reduction in time during the transient portion.

Figure 4 (click on icon for larger view)

Figure 4 is the promised look at results on ARSC's IBM SP3 (Icehawk) for an equal timeslice of the transient portion.

The original code ran in a blazing 9:55 (9 minutes, 55 seconds) total, while the tuned code ran in 8:31. The two transient portions ran in 3:50 and 2:25 respectively, a roughly 35% improvement in the tuned version for transient performance.

Since the compute time per time step is so low for the SP3, the MPI portion was a large percentage of the action, and hence a reduction in MPI time results in a large percentage improvement. The IBM SP3 is essentially a cluster technology, 4 CPUs per shared memory node, with nodes interconnected by a high speed switch. Each CPU has 8MB L2 cache, and so we have the combined benefits for this code of large cache and high performance CPUs.

In the next (and final) installment in this series, we address the question raised in part I. The problem is naturally parallel, so why doesn't it scale better? Is the tuned code more scalable?

### CUG Call for papers

CUG SUMMIT 2002 Manchester, United Kingdom 20th to 24th May, 2002 Call for Papers

The CUG SUMMIT 2002 on high-performance computation and visualization will be held from May 20 through 24, 2002, in Manchester, United Kingdom. Our host will be the University of Manchester.

For further details about the CUG SUMMIT 2002 and electronic abstract submission, please visit the CUG home page at URL:

http://www.cug.org/

The deadline for electronic abstract submissions is January 25, 2002.

### ANSWER: Vectorization

Here's one answer...

First, ask the compiler why the loop didn't vectorize. Add the "-Onegmsgs" option (for "negative messages") to learn why desirable optimizations, like vectorization and tasking, were not applied:

```
f90 -O3,negmsgs -rm -o trap.serial trap.serial.f
```

Excerpts from the listing file, trap.serial.lst:

```
32.  1--<       do i = 0 , n-1
33.  1
34.  1              ! Right side of current trapezoid. Left side of next.
35.  1              side(MOD(i,2)) = f(a+(i+1)*h)
36.  1
37.  1              integral = integral + h*(side(0) + side(1)) / 2.0
38.  1-->       enddo

f90-6287 f90: VECTOR File = trap.serial.f, Line = 32
A loop starting at line 32 was not vectorized because it contains a call to
function "F" on line 35.
```

Ahhhh... We knew that! Function and subroutine calls inhibit vectorization. Recompile with inlining to eliminate the function call. As described in > Quick-Tip #207 , if the function were defined in a separate source file, we'd use "-Oinlinefrom=<FNM>". In this case, use -Oinline4:

```
f90 -O3,inline4 -rm -o trap.serial trap.serial.f
```
The result:
```
31.  I----<>       side(1) = f(a+0*h)
32.  Vp----<       do i = 0 , n-1
33.  Vp
34.  Vp                ! Right side of current trapezoid. Left side of next.
35.  Vp I-<>           side(MOD(i,2)) = f(a+(i+1)*h)
36.  Vp
37.  Vp                integral = integral + h*(side(0) + side(1)) / 2.0
38.  Vp---->       enddo
```
"Vp" indicates the loop was "partially vectorized," which is encouraging. How much did this help?
```
CHILKOOT\$ hpm ./trap.serial
With n = 50000000  trapezoids, our estimate
of the integral from  0.E+0  to  1.  =  3.141592425474428

Group 0:  CPU seconds   :    8.02257      CP executing     :     4011283880

Million inst/sec (MIPS) :     215.56      Instructions     :     1729334273
Avg. clock periods/inst :       2.32
% CP holding issue      :      44.27      CP holding issue :     1775631714
Inst.buffer fetches/sec :       0.00M     Inst.buf. fetches:           9477
Floating adds/sec       :      31.16M     F.P. adds        :      250000513
Floating multiplies/sec :      37.39M     F.P. multiplies  :      300000430
Floating reciprocal/sec :       6.23M     F.P. reciprocals :       50000002
Cache hits/sec          :      19.53M     Cache hits       :      156646453
CPU mem. references/sec :      31.99M     CPU references   :      256671541

Floating ops/CPU second :      74.79M
```

We got a 3-fold improvement, but 75 MFLOPS is still disappointing. Recompile again with "negative messages" to get guidance from the compiler:

```
f90 -O3,negmsgs,inline4 -rm -o trap.serial trap.serial.f
```

And the listing file shows:

```
f90-1204 f90: INLINE File = trap.serial.f, Line = 31
The call to F was inlined.

f90-6209 f90: VECTOR File = trap.serial.f, Line = 32
A loop starting at line 32 was partially vectorized.

f90-6511 f90: TASKING File = trap.serial.f, Line = 32
A loop starting at line 32 was not tasked because a recurrence was
found on "SIDE" between lines 35 and 37.
```

OF COURSE! There's a dependency in this loop. The value of "side" must be computed before "integral". This is probably inhibiting vectorization as well as parallelization.

It was clever to reuse the value of "side" for two adjacent trapezoids, but let's go back to the simplest coding of trapezoidal integration, and see what happens. Replacing the loop with this:

```
integral = 0
do i = 0 , n-1
integral = integral + h*( f(a+i*h) + f(a+(i+1)*h) )/2.0
enddo
```

should remove all dependencies. Recompile with:

```
f90 -O3,negmsgs,inline4 -rm -o trap.serial trap.serial.f
```

and we see this in the loopmark listing file:

```
25.                 integral = 0
26.  V------<       do i = 0 , n-1
27.  V I I-<>           integral = integral + h*( f(a+i*h) + f(a+(i+1)*h) )/2.0
28.  V------>       enddo

f90-6204 f90: VECTOR File = trap.serial.f, Line = 26
A loop starting at line 26 was vectorized.

f90-1204 f90: INLINE File = trap.serial.f, Line = 27
The call to F was inlined.
```

The fully vectorized version runs at 1600 MFLOPS and completes in 0.65 CPU seconds. This is a 13-fold speedup in CPU seconds over the partially vectorized version:

```
CHILKOOT\$ hpm ./trap.serial
With n = 50000000  trapezoids, our estimate
of the integral from  0.E+0  to  1.  =  3.141592650025629

Group 0:  CPU seconds   :    0.65617      CP executing     :      328084890

Million inst/sec (MIPS) :      48.57      Instructions     :       31873220
Avg. clock periods/inst :      10.29
% CP holding issue      :      88.36      CP holding issue :      289879773
Inst.buffer fetches/sec :       0.01M     Inst.buf. fetches:           9461
Floating adds/sec       :     609.60M     F.P. adds        :      400000584
Floating multiplies/sec :     838.20M     F.P. multiplies  :      550000426
Floating reciprocal/sec :     152.40M     F.P. reciprocals :      100000001
Cache hits/sec          :       0.61M     Cache hits       :         397575
CPU mem. references/sec :       0.64M     CPU references   :         421673

Floating ops/CPU second :    1600.20M
```

Can you add anything to this discussion? Feel free to comment.

### Next Newsletter

For you Santas out there, your cards from North Pole are on the way.

We're taking Dec 28th off, and will produce the next newsletter on Jan 4. Also, we're updating our technical reading list and plan to print it in the next issue. If you'd like to recommend a book, let us know.

A safe and happy holiday to everyone!

### Quick-Tip Q & A

```
A:[[ As I migrate my code between Crays, IBMs, and SGIs, I assume
[[ I can just stick with the default optimization levels.  Is this a
[[ good assumption?

Nope.  Okay on Crays and IBMs, but on SGIs, default optimization is NO
optimization.  Try -O2 on the SGIs for starters.  Also, see the Quiz
answer, above.

If you're going into production, the compiler is your friend.  It can
really pay to analyze your code.

Q: What are your "New Years 'Computing' Resolutions" ???

For example, "I resolve to learn python, change all my
passwords, and ???"

(Anonymity will be preserved when we list these in the Jan 4th
issue.)
```

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
E-mail Subscriptions:
Archives:
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top