## Contents

### Amdahl's Law

This is depressing. But read it before you buy that 500 node cluster!

When migrating a code to a parallel or vector system, the maximum speedup to expect can be quantified using Amdahl's Law.

A simple application of it is to declare that different portions of a code can run at two discrete speeds: fast or slow. "Fast or slow" portions are those which, for instance, are "vectorized or scalar," "parallelized or serial."

Amdahl's Law predicts speedup as a function of the percentage of the code which is "fast" and the ratio of the "fast" speed to the "slow" speed.

The limit to speedup is fairly obvious. Say the code does 95% of its work "fast," and that "fast", in the limit, means instantaneous. The remaining 5% of the work will still remain to be done, and will drudge along at the "slow" speed.

Thus, in the case of a parallel code, Amdahl's Law traces the diminishing returns gained by throwing more processors at a problem. A simple (if unrealistic) assumption is that the ratio of the "fast" speed to the "slow" speed equals the number of processors. A sample result:

Maximum speedup for a code which is
95%
parallelized:

Maximum
Overall
NPES   Speedup
----   -------
1        1.00
10       6.90
50      14.49
100     16.81
1000    19.63
10000   19.96
This is depressing. What if you could squeeze out more parallelism?

Maximum speedup for a code which is
99%
parallelized:

Maximum
Overall
NPES   Speedup
----   -------
1        1.00
10       9.17
100     50.25
1000    90.99
5000    98.06
10000   99.02

Again, depressing... But this doesn't necessarily argue against buying more processors.

Peter Pacheco, in Parallel Programming in MPI (ISBN 1-55860-339-5) notes that Amdahl's Law assumes a fixed problem size. However, an important motivation for buying bigger machines is not so much to solve the same problem faster, but to solve larger instances of the problem. Much of the work done on big data sets is actually the parallel work, and thus, as the problem grows, the parallel fraction can also grow, and the asymptote defined by Amdahl's Law can improve.

These tables certainly argue, however, against using ever increasing numbers of processors on the same fixed-size problem. There is likely an optimum number which balances the performance limit shown by Amdahl's Law, overhead, such as communication costs, and cost in dollars for additional processors.

In the case of a vector code, the maximum ratio of "fast" to "slow" is fixed by the architecture and is the ratio of vector speed to scalar speed. On the SV1, the ratio is 20. A simple (if, again, unrealistic) assumption is that all vectorized sections of a code will run 20x faster than the scalar sections. In this situation, we can use Amdahl's Law to examine the benefit of increased vectorization, for instance:

Maximum speedup where vector speed is 20X scalar speed

Maximum
% of code    Overall
vectorized    Speedup
-----------   -------
20.0        1.23
60.0        2.33
80.0        4.17
90.0        6.90
92.0        7.94
94.0        9.35
96.0       11.36
98.0       14.49
100.       20.00

Here's a comment on vectorization from Cray's Online Document, "Optimizing Application Code on UNICOS Systems, 004-2192-003" (available at http://www.arsc.edu:40/ ):

It is not always easy to reach 70% to 80% vectorization in a program, and vectorizing beyond this level becomes increasingly difficult, usually requiring major changes to the algorithm. Many users stop their vectorization efforts once the vectorized code is running 2 to 4 times faster than scalar code.

By now, I'm sure you're dying to see Amdahl's Law itself. Here's a formulation for parallel code:

1
S = --------------
(1-F) + F/Sp

Where:
S  - is overall speedup
F  - is fraction of code that is parallelized
Sp - is speedup achieved in the parallel sections of the code

But don't get out your calculator, there's an unlikely UNICOS command, "amlaw," that will compute maximum speedups for you. For instance:

CHILKOOT\$ amlaw 100000000 99

100000000 CPUs, 99.00% Parallelism:
Max Theoretical Speedup is 100.00

CHILKOOT\$

Also, for Cray PVP users, the "atexpert" tool predicts your code's speedup if you were to run it on multiple dedicated CPUs. It provides a variety of statistics and recommendations, including a graph showing ideal speedup (as predicted by Amdahl's Law) and your code's predicted actual speedup. The difference between the two curves is the "overhead", which is a topic for another article!

### UAF Courses

The following two spring semester courses may be of particular interest to ARSC Users:

MATH 660: Advanced Mathematical Modelling, Spring 2001

Instructor: D. L. Hicks, a Visiting Professor in DMS/ARSC with experience as a researcher/consultant at various national laboratories (e.g., SNLA, AFWL, INEEL) engaging in research in mathematical models, computational science, and supercomputing.

Meeting Place: Chapman 107 Meeting Time: TR 9:45am-11:15pm email: dlhicks@Liebrock-Hicks.com telephone: 457-5817

Description: The models to be discussed are of fundamental importance in computational science and are the dominant driving forces behind supercomputing. Models of material dynamics, finance, economics, ecology, biology, and so forth, will be presented. This course will be especially valuable to students interested in computational X programs, where, for example, X = applied mathematics, biology, chemistry, engineering, physics, or science. Dr. Hicks invites those interested to come and share insights into this fascinating area.

CS693: Parallel Programming for Scientists.

Instructor: Guy Robinson, MPP Specialist and Scientific Liaison at ARSC with backgound in scientific computing at the European Center for Parallel Computing, Vienna, Austria; the Northeast parallel Architectures Center, Syracuse, NY; and the European Center for Medium Range Weather Forecasts, UK.

Meeting Time: R 07:00P-08:00P Meeting Place: Butrovich 007 email: robinson@arsc.edu telephone 474-6386

Description: The basics of scientific parallel computing will be taught with specific attention given to programming using MPI and openMP, the two established paradigms. After completion of this course students will be able to design, debug and assess the performance of simple parallel codes.

### Holiday Programming Contest

Fun for the holidays! Prizes!

Recent articles in this newsletter have focused on SV1 performance. Here's a challenge for those who might want to exercise their minds over the holiday period.

Achieve over 1 GFLOP/S on a single SV1 processor.

No contest would be complete without rules:
1. Your entry must arrive by 5pm Alaska time on January 8, 2001. Send it to "hpc_users@arsc.edu".

2. GFLOP/S will be measured, for the entire code, using hpm.

3. Your program must complete its run in under 2 minutes of CPU-time, again, as measured by hpm.

4. Your program must have fewer than 200 lines of source code. Documentation and white space won't count against this.

5. You must provide the code, the command for compilation, the command for running it, output from a sample run, and any necessary discussion. It must compile and run on chilkoot, where we'll do the timings.

6. Your entry will be judged under various objective and subjective criteria and the judges' decision will be final.

• The code which gets the most MFLOP/S is likely to win.
• Credit will be given for code producing a useful output.
• Credit will be given for programming style and illustration of optimization techniques and issues. We want to learn from this!
7. Excerpts from any entry may be published in the newsletter. Please point out the most useful or interesting aspects of your solution.

8. A prize will be awarded for every entry which follows the rules and exceeds 1 GFLOP/S, but the winner will get the BEST prize.

9. If you submit your entry on punch cards, pregnant chads will NOT be considered.

10. If, for some reason, these rules need adjusting, we'll do so in the web edition:

/arsc/support/news/hpcnews/hpcnews210/index.xml

You might take a last look before submitting your entry.

### Quick-Tip Q & A

A:[[ My MPI code doesn't know until runtime how many messages the PEs
[[ will be exchanging.  Given this problem, how can I match a receive
[[ to every send, as required by MPI?

This questions implies that a loop iterates over all the sends and
receives, and that the number of iterations is fixed at compile
time. Here are two alternative algorithms to consider:

1) If the sender computes the number of sends, NMSGS, before entering
the loop, it could inform the receiver of this number in a
preliminary exchange.  Then they could iterate together in lock step
over the NMSGS sends/recvs.

2) Assuming the program would be correct even if the receiver were made
to accept messages with any tag, you could define some unused tag as
a flag to mark the end of the sends. E.g.:

#define TAG_ALLDONE   9999

The sender would send one extra message, using TAG_ALLDONE, to
terminate the exchange. On receiving the message with this tag, the
receiver would bail out of its receive loop.  E.g.,

for (;;) {
MPI_Recv(data, data_sz, MPI_FLOAT, source, MPI_ANY_TAG,
MPI_COMM_WORLD, & status);

if (status.MPI_TAG == TAG_ALLDONE)
break;

/* process normal message */
....
}

Q: I use "mpirun" to launch jobs on my cluster and it's a hassle.
The command lines get elaborate and long, I make typos and forget
various options and flags.  Any Suggestions?

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
E-mail Subscriptions:
Archives:
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.