ARSC T3E Users' Newsletter 136, February 06, 1998
Beowulf Seminar
Thomas Sterling of JPL/Caltch will be the featured guest at a seminar on Beowulf, to be held at ARSC, on March 5th and 6th. More information to follow.
ARSC Spring Training Schedule
February 11: Introduction to ARSC: Is there a supercomputer in your future? February 18: Introduction to the YMP February 25: Introduction to Parallel Programming March 4: Introduction to NQS March 11: Introduction to CRL March 25: Scientific Visualization at ARSC
For details and registration forms, see: http://www.arsc.edu/user/Classes.html .
Monte Carlo Methods: Introduction
This is the first in a four part series. The series will introduce the following aspects of Monte Carlo Methods: history and mathematical theory (contributed by Dr. Giray Okten); a sample parallel implementation (Tom Baring); and scientific applications and advanced MPI support (Guy Robinson).
[ Many thanks to Dr. Giray Okten, Visiting Professor of Mathematics, University of Alaska, Fairbanks, for contributing the first two articles in this series. ]
Monte Carlo methods were mainly developed in the 1940s, by mathematicians and scientists, including von Neumann, Fermi, Ulam and Metropolis, working on the development of nuclear weapons in Los Alamos during World War II. It is believed that the term "Monte Carlo" was coined for these methods as a code word, by Von Neumann and Ulam, suggesting the "probabilistic" nature of them. Indeed, these methods were first applied to problems with apparent probabilistic structure.
The first documented application, where we see the ideas behind Monte Carlo methods, appears in a paper by G. Comte de Buffon in 1777. He described an experiment in which a needle of length "L" is thrown randomly onto a floor ruled with parallel straight lines a distance "D" apart. He then suggested estimating the probability "P," that the needle will intersect a line, by throwing the needle many times and calculating the ratio of the number of throws intersecting a line to the total number of throws. He also carried out the mathematical analysis and calculated the exact probability.
Later, Laplace suggested that this idea could be used to evaluate "PI" from the throws of the needle. This is referred to as "Buffon's Needle Experiment," and proceeds as follows:
For L <= D, the exact probability, P, that the needle intersects one of the parallel lines is:
P = (2 * L) / (PI * D).
If we drop the needle (or simulate dropping it) N times, and count R intersections, we obtain an experimental estimate for P,
P = (R / N),
and thus, an estimate for PI:
PI = (2 * L * N) / (R * D).
As explained below, the estimate improves as the number of throws increases.
(Visit the web site:
http://stud2.tuwien.ac.at/~e9527412/Buffon.html
if you want to make this experiment online, and see what the estimate for PI is for various numbers of throws!)
Monte Carlo methods can be used not only in naturally stochastic problems, but also in deterministic problems. In general, any quantity that can be written as the expected value of a random variable defined on a probability space can be estimated by these methods, by calculating the average value of the random variable.
Using a less mathematical language, we can restate this in the following way. Suppose you have a problem and you want to estimate a certain quantity. Simulate the problem using random numbers (we will discuss random numbers later) and obtain an estimate for the quantity of interest. Do this simulation a number of times and take the average of the different estimates you get as the outcomes of these simulations. This average will be a good estimate, if the number of times you do the simulation (sample size) is large .
The reason why we can use averages to estimate the quantity of interest comes from the strong law of large numbers, which states that the sequence of sample means of a random variable converges to its expected value almost surely, provided the expected value is finite. In addition, from the central limit theorem, the error of this approximation satisfies the probabilistic bound, O(1/Sqrt(N)), where N is the sample size .
A simple application is numerical integration, where one uses the averages of function values evaluated at random coordinates to estimate the definite integral of the function, say, on the unit cube. By "random" numbers, we mean independent and identically distributed realizations of the random variable with the uniform distribution on the unit cube.
The concepts random variable , independence , and identically distributed , are welldefined concepts of the probability theory. However, when we carry these concepts to the physical world, especially in an attempt to generate random numbers, we face several difficulties, confusion, and and ongoing debate. When we think of a random process, we often think of events whose outcomes cannot be determined, such as physical outcomes of the scattering of an electron by an atom, or simply the outcome of flipping a fair coin. However, the fact that we cannot determine the outcome of flipping a fair coin is not a property of the physical process itself, but merely an indication of our understanding of the process, making it difficult to define uncertainty , or randomness in the physical world. In spite of this difficulty (or maybe, because of it), several mathematicians and philosophers, including Kolmogorov, MartinLof, Chaitin, and Popper, have suggested various definitions of random numbers.
Whatever our understanding of a random event or a random sequence of numbers might be, we go ahead and try to find algorithms, which mainly use arithmetical operations, that would generate numbers that look like random . These imitations are called pseudorandom numbers and the success of this imitation is determined by the use of statistical tests, often called tests of randomness . Some of the methods that generate pseudorandom numbers as well as some statistical tests can be found at:
There are also other interesting links on this page.
"Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random numberthere are only methods to produce random numbers, and a strict arithmetical procedure is of course not such a method. (It is true that a problem we suspect of being solvable by random methods may be solvable by some rigorously defined sequence, but this is a deeper mathematical question than we can go into now.)" [von Neumann]
Perhaps what von Neumann stated as a deeper mathematical question was solved, at least partly, by the theory of uniform distribution (a branch of "Number Theory"), which claims that the reason our random methods work does not have anything to do with their randomness , but uniformity . The uniformity of a sequence is a number theoretic concept: it is defined without using probabilistic concepts. This leads us to the notion of quasiMonte Carlo methods or numbertheoretic methods.
Next week: "Monte Carlo Methods: QuasiMonte Carlo and LowDiscrepancy Sequences"
Data Types and Sizes
One common problem encountered when porting code between different systems is the size of the data representation being used. Language standards and definitions rarely specify the number of words which make up each data type. Indeed with the rapid pace at which processors change and memory sizes increase this is not the concern of a language. However a language must provide a mechanism by which the user can specify or at least determine the different number of bytes in a particular data type and this is typically where some confusion often occurs. Below are listed the sizes of various data types on some common systems:
Fortran C Number of bytes ================= ========== =============== YMP/J90/C90  INTEGER int 8 long 8 REAL float 8 double 8 DOUBLE PRECISION long double 16 T3E  INTEGER int 8 long 8 REAL*4 float 4 REAL double 8 DOUBLE PRECISION long double NOT AVAILABLE. SGI Workstations  INTEGER int 4 long 4 REAL*4 float 4 REAL double 8 DOUBLE PRECISION long double NOT AVAILABLE.(double assumed.)
When is the size of the type important?
 If the code passes data between C and Fortran, the machine representation (including size in bytes) must be the same. On the YMP/C90/J90 systems, code may exchange data between C and Fortran routines as Fortran type REAL and C type float or double. Yet this code will get into trouble on the T3E system where the C floats are only 4 bytes long.
 If the code reads an unformatted data file. If 4byte floats were written, then 4byte floats must be read.
 If the code requires 64 bits of numeric precision, for either internal stability, precision of results, accuracy of results, or all three.
 If the code is optimized for efficient memory usage by specifying 4byte floats. In some cases, doubling the memory requirements (by doubling the precision) would exceed available memory.
Users occasionally ask about the sizes of various types on ARSC systems. The most satisfying answer is obtained by running the tests oneself. The codes, below, reports the sizes of common types in fortran and C, and can be run on any system.
C code /**********************************************************************/ #include <stdio.h> main(argc, argv) int argc; char **argv; { printf(" sizes are int %d\n",sizeof(int)); printf(" sizes are long %d\n",sizeof(long)); printf(" sizes are float %d\n",sizeof(float)); printf(" sizes are double %d\n",sizeof(double)); printf(" sizes are long_double %d\n",sizeof(long double)); } /**********************************************************************/ Fortran Code ! program prog implicit none integer i real r double precision d logical j write(6,*) ' integers are ',kind(i) write(6,*) ' reals are ',kind(r) write(6,*) ' double precision are ',kind(d) write(6,*) ' logicals are ',kind(j) end !
Fortran users might like to inspect the size of logicals on various systems to see how much space can be saved by using these in a program!
Side note:
There is often a compiler flag which allows selected variables to be treated as double precision or not. For example with the f90 compiler on the T3E, users can use i 64 to specify all integers are 64 bits, i 32 specifies 32 bits. This is most useful for programs which have many large integer arrays where there will be a significant increase in storage between a workstation and a YMP/C90/J90/T3E. However users must be careful that 32 bits provide sufficient resolution.
Eregisters
[ One of our readers requested some hardcore information on Eregisters.
Having perused the literature, he claimed that, while often touted, Eregisters were inadequately explained. This article, we hope, shines more light into this "blackbox." The following is an excerpt from the article, "Performance of the CRAY T3E Multiprocessor" by: Ed Anderson, Jeff Brooks, Charles Grassl, and Steve Scott of SGI/Cray. which appears on the SC97 CDROM (under Ed Anderson's talk) and is available on ARSC's ftp server, at: ftp://ftp.arsc.edu/pub/mpp/docs/T3E_processor_perf.ps.Z The T3E benchmarker's guide has a little Eregister discussion as well: ftp://ftp.arsc.edu/pub/mpp/docs/bmguide.ps.Z Thanks to Jeff Brooks for his help. ]
3.3 Eregisters
As mentioned in Section 2, all remote communication and synchronization in the CRAY T3E is done between Eregisters and memory. Eregisters provide two primary benefits over a more conventional load/store mechanism for accessing global memory: they extend the physical address space of the microprocessor to allow a very large, directly addressable memory, and they dramatically increase the degree of pipelining attainable for global memory requests. They also provide efficient nonunitstride memory access and gather/scatter capabilities.
3.3.1 Description
Eregisters employ a novel address translation mechanism. The microprocessor implements a cacheable memory space and a noncacheable I/O space, distinguished by the most significant bit of the physical address. Local memory loads and stores use cacheable memory space. Address translation takes place on the processor in the usual fashion, and physical addresses are passed to the memory. Memorymapped registers, including the Eregisters, use noncacheable I/O space.
There are two primary types of operations that can be performed on Eregisters:
 Direct loads and stores between Eregisters and processor registers.
 Global Eregister operations.
Direct loads/stores are used to store operands into Eregisters and load results from Eregisters. Global Eregister operations are used to transfer data to/from global (meaning remote or local) memory. The global operations to read memory into Eregisters or write Eregisters to memory are called Gets and Puts, respectively. Gets and Puts are triggered via I/O space stores, in which the address supplies the command and target/source Eregister, and the data bus supplies the global memory address. Virtualtophysical address translation thus occurs outside the processor in the system logic. Complete details are provided elsewhere [7].
Access to Eregisters is implicitly synchronized by a set of state flags, one per Eregister. A Get operation marks the target Eregister(s) empty until the requested data arrives from memory, at which time they are marked full. A load from an Eregister will stall if the Eregister is empty until the data arrives. A Put from an Eregister will also stall if the Eregister is empty until the data becomes available. A global memory copy routine might perform Gets from the source area of memory into a block of Eregisters and subsequently Put the data from the Eregisters to the target memory area. The implicit state flag synchronization protects against the readafterwrite hazard in the Eregisters.
Since there are a large number of Eregisters (512 user + 128 system), Gets and Puts may be highly pipelined. A large number of Gets can be issued, for example, before the first result is accessed. The processor bus interface allows up to four properlyaligned Get or Put commands to be issued in a twocycle bus transaction, allowing 256 bytes worth of Gets or Puts to be issued in 26.7 ns. This issue bandwidth is far greater than the sustainable data transfer bandwidth, so the processor is not a bottleneck. Data in a memorytomemory transfer using Eregisters does not cross the processor bus; it flows from memory into Eregisters and out to memory again.
In addition to providing a highlypipelined memory interface, the Eregisters provide special support for singleword load bandwidth. A variant of Gets and Puts moves a series of eight 64bit words separated by an arbitrary stride between memory and Eregisters. Thus, row accesses in Fortran, for example, can be fetched into contiguous Eregisters using strided Gets. The resulting blocks of Eregisters can then be loaded "broadside" into the processor, making significantly more efficient use of the bus than would be possible with normal cache line fills, in which 7/8 of the loaded data would not be used. Singleword Gets and Puts are also provided, which can be used for gather/scatter operations.
When Eregister operations are used to transfer data to/from local memory, the cache is bypassed. Noncached data access may be preferred for copying blocks of data that are too large to fit in the cache and for any largestride or irregular access patterns that would bring in unwanted data if loaded a cache line at a time. The use of Eregister operations for local data access can be suggested to the compiler by means of the looplevel CACHE_BYPASS directive, or it can be done directly by calls to Shmem library routines, specifying the local processor number instead of a remote processor number in the argument list.
Book Review: "Scientific Visualization  Overviews, Methodologies, Techniques"
"Scientific Visualization  Overviews, Methodologies, Techniques" Editors: Gregory M. Nielson, Hans Hagan, Heinrich Muller IEEE Computer Society (1997) 577 pages ISBN 0818677775
This book comprises 21 papers on stateoftheart scientific visualization topics, written by 45 contributing authors. Contents are organized in three sections:
Part 1  Overviews and Surveys Part 2  Frameworks and Methodologies Part 3  Techniques and Algorithms
Part 1 contains overview and survey papers on some important visualization topics. The first paper gives a concise historical overview of methods for multivariate, multidimensional data, from the pencil and paper era to to the present. The authors (Pak Chung Wong and R. Daniel Bergeron) divide visualization history into 4 stages: (1) pre1976 Searching, (2) 19771985 Awakening, (3) 19861991 Discovery, and (4) 1992present Elaboration and Assessment. The late1980's were a period of huge growth in the field, fueled by advances in algorithms, software, and supercomputing, computer graphics, and data storage hardware. The remaining chapters in part 1 cover immersive investigation techniques, grid generation methods, computational steering and methods for largescale unsteady fluid flows.
Chapter 3, "Survey of Grid Generation Methodologies and Scientific Visualization Efforts" (Bernd Hamann and Robert J. Moorhead II) was of particular interest. The authors describe visualization work underway at ERCCFS at Mississippi State University. They document development and use of several interesting methods and tools for oceanographic data (OVIRT, SCIRT,AGP). They also discuss data compression issues for oceanographic data sets. They describe a waveletbased representation scheme that is fast, efficient, and has good data reconstruction qualities. The authors also discuss interactive, concurrent visualization techniques, including a variation of the FAST scheme adapted for their CFS codes.
Part 2 contains papers on formal foundations: integrated volume rendering and data analysis in wavelet space, intelligent design of color visualizations, engineering perceptually effective methods for abstract data, comparative visualization of flow features, systematic analysis for design, and controlled interpolation.
Part 3 contains indepth, review articles on both established and emerging algorithms and techniques. Topics include: feature visualization, flow surface probes for vector fields, particle tracing on 3D curvilinear grids, adaptive generation of volume contours, deformation tensor fields, a primer on reactiondiffusion patterns, relativistic strings, hierarchical representation of large data sets using wavelets, tools for triangulations and tetrahedrizations, and tools for computing tangent curves and topological graphs. Feature extraction and analysis and data compression/reconstruction are two important emerging techniques, particularly in the treatment of gigabyte and terabyte data sets.
In summary, the book is a excellent general introduction to stateoftheart visualization methods for practicing scientists, as well as an excellent reference for programmers or computer scientists.
QuickTip Q & A
A: {{ Say you submit a single, 8hour batch request using qsub and that the qsub script launches a series of short jobs. Is there a way to prevent one of the short jobs from using more than its share of the 8 hours? }} Assuming that each job is entitled to about the same amount of time, use NQS. In your qsub script, the option #QSUB l mpp_t hh:mm:ss specifies the limit to the total execution time of the entire NQS request (which is the sum of the time of the included processes). The option: #QSUB l p_mpp_t hh:mm:ss however, limits the time of each process. The following example assumes that a normal run of "myprog" takes something less than 30 minutes. The user expects that none of the four processes started will reach the 30 minute per process time limit and that the entire request will complete in under 2 hours. However, he/she may have inherited the code, worry that it doesn't handle exceptions well, and that one of the data sets might drop it into an infinite loop. Were this to happen, NQS would interrupt the offending process after 30 minutes, thus leaving the other processes with their share of the 2 hours. ###################################### #QSUB q mpp #QSUB l mpp_p=8 #QSUB l mpp_t=2:00:00 #QSUB l p_mpp_t=30:00 cd ${TMPDIR}/progs mpprun n8 ./myprog data_set.1 mpprun n8 ./myprog data_set.2 mpprun n8 ./myprog data_set.3 mpprun n8 ./myprog data_set.4 ###################################### Q: You want to compile your parallel application and then run it on N T3E application PEs. What's the lazy man (or woman's) way to determine if adequate memory exists on those N PEs to run the program. (Assume static allocation of all memory and that you're compiling/running on the same system.)
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.