## ARSC HPC Users' Newsletter 287, February 27, 2004

### ARSC Events

Engineering Open House:
 When: Saturday, February 28 from 9am to 5pm Duckering Building, ARSC Access Lab

ARSC's Duckering access lab will be open to the public with demonstrations of ARSC research as well virtual reality demonstrations on the ImmersaDesk.

Discovery Tuesday:
 When: March 2, 1pm ARSC Discovery Lab, 375C Rasmuson Library

The subject of ARSC monthy presentation in the Discovery Lab is "Visualizing Chemistry: Tools for the Discovery Lab". It will be presented by Roger Edberg, ARSC Visualization Specialist, and Tom Marr, President's Research Professor in Bioinformatics.

### Vectorizing a Recurrence, Part II

As discussed in the last issue, the vectorization of a loop can be inhibited by a variety of factors, including recurrence. A recurrence occurs when a calculation in one iteration depends on a result computed in an earlier iteration.

Here's a reformulation of the example from the last issue, which produces a table of factorials:

```
RD = 1
F(0:RD-1) = 1
do i = RD,N
F(i) = F(i-RD) * i
enddo
```

As noted, the SX-6 has a special instruction which allows this unvectorizable loop to "vectorize."

Forgetting about the intent of the loop, what happens if "RD" (what I'm calling the "recurrence distance") is allowed to increase? With RD==2, neither the SX-6 nor X1 can vectorize it. With RD>=3, however, the X1 vectorizes the loop. This is not a big surprise if you're familiar with the concept of "safe vector length." The surprise (to me, anyway) is that, except when RD==1, the SX-6 doesn't vectorize the loop.

What is "safe vector length" and how can it be used to vectorize a recurrence?

Back to the example, the computation of some block of "RD" elements, F(i) .. F(i+RD-1), depends on the results already computed for the previous block. This is enumerated below for RD==5, where F'(i) indicates that element "i" has been updated by the loop, and F(i) indicates that the element has not been updated yet and retains its original value:

```
F'(20)
F'(21)
F'(22)
F'(23)
F'(24)
F(25) = F('20) * 25
F(26) = F('21) * 26
F(27) = F('22) * 27
F(28) = F('23) * 28
F(29) = F('24) * 29
```

Since F(20)..F(24) have already been computed, it appears we could process the entire block, F(25) through F(29) simultaneously and still get the correct result.

A single vector instruction on the X1 produces a block of 64 "simultaneous" results, so if we made RD==64 (or more), the loop should be able to vectorize. This is exactly what happens, but it gets even better.

The Cray compilers are able to process blocks of less than 64, and thus, even the loop with RD==5 would vectorize. The term "safe vector length" means the size of the block, for a given loop, which is safe to process simultaneously. The compiler tries to determine the safe vector length for a recurrence, like the above, automatically. If "RD" were defined in the program as a parameter, then the safe vector length for the above loop would be known at compile time.

If RD were computed, or read in from a file, then it wouldn't be known until run-time. In this case, the compiler inserts instructions to computes the safe vector length at run-time. Or, alternatively, if the programmer knows in advance what the safe vector length should be, he or she can insert an "IVDEP" directive, like the following, into the code:

```
!DIR\$ IVDEP SAFEVL=5
```

(I've searched the SX-6 manuals and not found a concept similar to "safe vector length," and tested various recurrence loops on the SX-6 which vectorize on the X1 and they don't vectorize. If I've missed it, let me know!)

Here's a program to see the effect of vectorizing recurrences using safe vector length:

```
program veclen
implicit none
integer(kind=8),parameter :: RD=RECURDIST
integer(kind=8),parameter :: N=100000000
real(kind=8), dimension(0:N + RD) :: F
integer(kind=8) :: i, j

call random_number (F(0:RD-1))

do j = 0, 10
do i = RD + 1, N + RD
F(i) = (F(i-RD) + F(i-RD-1)) / 2.0
enddo
enddo

do i = 0,16
write (*,"(I12,'  ', E)") 2**i, F(2**i)
enddo
end
```

Here's an X1 command to compile it, and set RD=3:

```
ftn -Omsgs -Onegmsgs -o veclen -eZ -F -D RECURDIST=3 veclen.f
```

In this command, "-Omsgs" asks the compiler to tell how it optimized the loops, "-Onegmsgs" asks it to tell why it couldn't perform various optimizations.

When RECURDIST=1 or 2, the ftn compiler gives us this "negmsg" (the inner loop "do i = RD + 1, N + RD" is line 11):

```
A loop starting at line 11 was not vectorized because a recurrence
was found on "F" at line 12.
```
When RECURDIST=3, the compiler gives us this "msg" (it gives a similar "msg" for any RECURDIST up to 63):
```
A loop starting at line 11 was vectorized with a vector
length of 3.
```
When RECURDIST=64 or greater, the compiler gives us this "msg":
```
A loop starting at line 11 was vectorized.
```

This last message tells us that the loop was unconditionally vectorized, using the maximum X1 vector length (the size of the vector registers) of 64.

Here's a script to automate the process of recompiling and running the code. (The X1 command, "pat_hwpc," dumps performance information for each run, like setting F_PROGINF=DETAIL on the SX-6 or running "hpm" on the SV1ex).

```
#!/bin/ksh

for RD in 1 2 3 4 5 6 7 8 10 20 30 40 50 100 200 300 400 500 1000
do
echo "========================="
echo "Recompiling with RD=\${RD}"
ftn -o veclen -eZ -F -D RECURDIST=\${RD} veclen.f
pat_hwpc ./veclen
done
```

And, having run the script, here's a table of results. In this table, each line is one run of the program, where:

```
RD:
Value of RECURDIST used

CPUT:
CPU time (from pat_hwpc)

MFLOPS:
MFLOPS, total for the run (from pat_hwpc)

VLEN:
Average vector length used by all vector instructions (from pat_hwpc)

VINST:
Total number of vector instructions in the run (from pat_hwpc)

X1
-------------------------------------------
RD     CPUT  MFLOPS    VLEN         VINST

1   311.11     7.1   61.19         33470
2   263.63     8.3   61.58         33470
3    76.73    28.7    3.00    1833366827
4    58.79    37.4    4.00    1375033453
5    47.17    46.6    5.00    1100033457
6    39.90    55.1    6.00     916700142
7    34.31    64.1    7.00     785747782
8    30.11    73.1    8.00     687533453
10    24.24    90.8   10.00     550033457
20    12.77   172.3   20.00     275033457
30     8.72   252.3   30.01     183366827
40     7.04   312.6   40.01     137533457
50     5.86   375.3   50.00     110033457
100     4.29   513.1   64.00      85970964
200     2.12  1037.2   64.00      85970978
300     1.81  1212.6   64.00      85970985
400     1.52  1444.7   64.00      85970999
500     1.49  1476.3   64.00      85971006
1000     1.49  1476.9   64.00      85971062
```

The number of operations done in each run is nearly identical, so looking at the CPUT column, the benefit of longer vector length is clear. Since each VINST performs VLEN actual operations, the inverse relationship between VLEN and VINST is as expected.

One might ask why MFLOPS tops out at about 1500 on this 12.8 GFLOPS multi-streaming processor. Part of the answer is clear from another of the "negmsgs" returned by the compiler:

```
A loop starting at line 11 was not multi-streamed because a
recurrence was found on "F" at line 12.
```

While the compiler can vectorize the recurrence it can't multi-stream it. Thus, it's confined to only one of the four 3.2 GFLOPS single-streaming processors which together comprise one (multi-streaming) processor.

For comparison, here are SX-6 results for the same program and a similar script. The SX-6 compiler command is:

```
f90 -Ep -D RECURDIST=\${RD} -Wf"-pvctl infomsg" -o veclen veclen.f
```

Performance numbers are taken from output of F_PROGINF=DETAIL.

```
SX-6
-------------------------------------------
RD     CPUT  MFLOPS    VLEN         VINST

1    38.68    56.9  185.88          186
2    28.97    76.1  185.92          186
3    28.51    77.5  186.60          184
4    23.13    95.1  186.02          186
5    23.14    95.1  186.70          184
6    24.28    90.6  186.74          184
7    23.39    94.1  186.79          184
8    23.29    95.3  186.22          186
10    23.73    92.7  186.94          184
20    23.42    93.9  187.43          184
30    24.04    91.5  187.92          184
40    21.38   102.9  188.41          184
50    21.38   102.9  188.90          184
100    21.05   104.6  191.34          184
200    22.20    99.1  186.47          195
300    21.21   103.8  182.11          206
400    21.09   104.3  178.20          217
500    21.26   103.5  182.35          217
1000    20.95   105.0  172.74          261
```

Note that, unlike the factorial example, the SX-6 doesn't vectorize the recurrence in this test program, even when RD==1. From the table, we see no vectorization or dramatic improvement from increasing the value of RD.

### Programming Environment Upgrade on Klondike

On 2/25/2004, we upgraded the default programming environment on the X1 from PE 5.0 to PE 5.1. At this point:

```
PrgEnv.old
:           points to the former PrgEnv (5.0)

PrgEnv [the default]
: points to PE 5.1

PrgEnv.new
:           points to the current PrgEnv (5.1),
but can be updated with little
notice and no internal review
as Cray releases new versions of
compilers and other PE components.

```
Anyone needing to conduct tests using the old PE can switch back with the command:
```

module switch PrgEnv PrgEnv.old
```

For more on programming environments and "module" commands read "news prgenv", "man module", or contact consult@arsc.edu.

### PEvers Utility Available on Klondike

Thanks to John Metzner of Cray Inc. for porting the "PEvers" tool to the X1. This shows you all available versions of the programming environment products, and most importantly, shows which is the default.

The default PE is what you get with the following command, which is included of every user's .profile or .login shell startup file:

```
```

Here's a portion of the output of PEvers, giving information the ftn compiler:

```
KLONDIKE:baring\$ PEvers
The following Programming Environment Packages are installed:
=============================================================
/opt/ctl/cftn
4.3.0.0
4.3.0.1
4.3.0.2
4.3.0.3
5.0.0.0
5.0.0.1
5.0.0.2
5.0.0.3
5.0.0.4
5.1.0.0
5.1.0.3
5.1.0.5

The current default version is 5.1.0.3.

=============================================================
```

PEvers is available on the X1, T3E, and SV1ex at ARSC.

### Quick-Tip Q & A

```
A:[[ A word processor I use on another operating system is always
[[ changing what I type.  For instance "..." becomes one character which
[[ looks like three dots spaced a little differently.  Text export of a
[[ file containing these things doesn't fix them.  How do I get rid of
[[ these when I ftp the file to my Unix box, where my "..." now looks
[[ like "\311" ?

#
# Thanks to Andrew Markiel
#

1) Somewhere in the menus is an "AutoCorrect" menu. Turn off all of the
auto-correction features.

2) Use OpenOffice (openoffice.org), which is a free open-source
cross-platform version of Office. It'll still AutoCorrect your text
(unless you turn it off), but you don't have to convert the file in

3) Use JEdit (www.jedit.org), which is a free open-source cross-platform
Java-based source-code editor (which also work OK for text). It does a
better job of avoiding AI (artificial ignorance).

#
# Thanks to Greg Newby:
#

Changing '...' to an ellipsis character is a form of auto-correction,
just like your word processor probably changes 'hte' to 'the'.  You
should be able to turn this off for any substitutions that you would
rather not have.  Other common and annoying substitutions include
replacing '(c)' with a circle-C, and automatically superscripting 'tm'.
Turning these substitutions off will save you trouble later.

If you can't prevent the strange characters, try the "recode" command
ftp://ftp.gnu.org/gnu/recode).  This takes a file and changes it from
one character set to another.

Something along these lines would work (but check first with a backup
file; add "-sqf" to force one-way transformations):
recode cp1252..latin1 filename.txt
or      recode cp1252..ascii filename.txt
or      recode cp1252..dos filename.txt

(there are different input and output character sets; use "recode -l"
for a listing).

#
# From the editor
#

If you've got a file containing unwanted codes here's another way to
work them out:

1) look at the file (or part of it) using "od -c".  This shows the
octal codes for the non-printing characters:

%    od -c file.txt
0000000  \r   t   h   i   s       i   s       a       l   i   t   t   l
0000020   e       d   e   m   o       o   f       t   h   e     311
0000040   f   e   a   t   u   r   e   s       o   f  \r   t   h   e
0000060   a   u   t   o   c   o   r   e   c   t     311       f   u   n
0000100   c   t   i   o   n       o   f       t   h   i   s       w   o
0000120   r   d       p   r   o   c   e   s   s   o   r     252   ,
0000140   h   e   r   e   .      \r  \r  \n
0000151

2) Decide on suitable replacements for the non-printing codes.  For
instance, change octal 311 to "...", 252 to "(tm)", and carriage
returns to newlines.

3) The following perl command will do it, printing the results to
stdout, like this:

%    perl -p -e 's/\r/\n/g; s/\252/\(tm\)/g; s/\311/.../g;'  file.txt

this is a little demo of the ... features of
the autocorect ... function of this word processor (tm), here.

Q: Is there a way to invalidate my kerberos ticket before I trot off
to lunch?  It seems a little risky to leave valid tickets sitting on
my workstation when I'm not around.```

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
E-mail Subscriptions:
Archives: