Single-Processor Tuning

3 Single-Processor Tuning

Now that we understand the architecture of the Origin2000 system, it is time to address "tuning," i.e., how do we get our programs to run fast on this architecture?

We can break tuning into 5 steps:

Just as in tuning, we'll now take things one step at a time.

3.1 Step 1: Get the Right Answers

If you don't have to get the right answer, you can make a program run arbitrarily fast, so the very first rule in tuning is that the program has to generate the correct results. While this may seem obvious, it is quite easy to forget to check the answers along the way as one is making performance improvements. A lot of frustration can be avoided by making only one change at a time and verifying that the program is still correct after each change.

3.1.1 Porting Issues

In addition to problems introduced by performance changes, correctness issues may be raised simply in porting a code from one system to another. If the program is written in a high level language and adheres to the language standard, it will generally port with no difficulties. This is particularly true for standard-compliant Fortran-77 programs, but there are some exceptions for C programs.

Irix 6.x is a 64-bit operating system. While programs may be compiled and run as if it were a 32-bit operating system, you often want to take advantage of the larger address space and compile in 64-bit mode. In this mode, both pointers and long integers are 64 bits in size, while int's remain 32 bits. If a program assumes that pointers and int's are both 32 bits long, run-time faults and/or incorrect answers can result. If this occurs, there are two remedies: either correct the program so that types of different sizes are not haphazardly intermixed, or ignore the coding problems and compile the program to run in 32-bit mode. The second solution is certainly expedient, and is even a perfectly reasonable approach for programs which will never grow beyond the confines of a 32-bit address space. But if one wishes a program to be able to exploit the advantages of the 64-bit operating system, correcting the size problems is the proper solution.

One selects which mode to compile in via the flags -32, -n32 and -64. The "mode" is more often called the ABI, or Application Binary Interface (see abi(5) for a detailed explanation of ABIs). -32 is the default ABI since it provides the highest level of portability. However, programs compiled using this ABI not only cannot take advantage of many of the extra capabilities of the 64-bit operating system, they also are limited to the MIPS I and MIPS II instruction sets. If one wishes to use the capabilities of the MIPS III or MIPS IV instructions sets, then one needs to compile with either -n32 or -64.

With the -n32 ABI, pointers and long ints are still 32 bits in size, so one retains the portability of the -32 ABI. -n32 differs from -32 in that instead of generating programs with just the MIPS I or MIPS II ISAs, it allows programs to be generated using either the MIPS III or MIPS IV ISAs, as selected by the -mips3 and -mips4 flags, respectively. With the -64 flag, pointers and long ints are 64 bits in length, so full advantage may be taken of the 64-bit operating system. As with -n32, either the MIPS III or MIPS IV ISAs may be chosen.

On Origin2000 (and all other non-R8000) systems, the default ABI and ISA are -32 -mips2; on R8000-based systems, the defaults are -64 -mips4. Because the defaults vary by system type, it is highly recommended that the desired ABI and ISA be explicitly declared for each compilation. The clearest way to do this is to specify these options directly in a makefile (see make(1) and pmake(1)) so they are used by all compilation commands:

ABI     = -n32
ISA     = -mips4
CFLAGS  = $(ABI) $(ISA) -Ofast

        .
        .
        .

.c.o:
        $(CC) $(CFLAGS) -c $<

An alternative approach is to to make use of a defaults specification file. This is a file which tells the compiler which ABI and ISA it should use if none have been specified in the compilation command. The system adminstrator may create the file /etc/compiler.defaults to set the defaults for a particular system. Individual users may create their own compiler.defaults files and then tell the compiler where to look for them by setting the environment variable COMPILER_DEFAULTS_PATH; the first such file found in this colon-separated path list will take precedence over the system-wide file. If a default specification file has been provided, its contents will override the compiler's built-in defaults (but not the options specified explicitly in the compilation command).

The contents of the compiler.defaults file must be the the following specifier:

-DEFAULT:[abi={32|n32|64}]:[isa={mips1|mips2|mips3|mips4}]:[proc={r4k|r5k|r8k|r10k}]

which allows one to indicate the default ABI and ISA, as well as which processor to compile for (more on this option later).

One final method is available for specifying the ABI. One can set the environment variable SGI_ABI to -32, -n32 or -64 to specify the default ABI. If this environment variable is set, it will override the setting in the defaults specification file; a command line setting, however, will take precedence over this environment variable.

Unless it is a requirement that the binary can be run on an R4000-based system (which supports MIPS III, but not MIPS IV), use one of the following ABI/ISA combinations

% cc -64 -mips4 ...
% cc -n32 -mips4 ...

for C, and similarly for the other languages. The same ABI (but not necessarily ISA) must be used for all objects comprising a single executable. Different executables may use different ABIs, however. For more information on these compiler flags, as well as the defaults specification file, see the man pages for cc(1), f77(1), CC(1), f90(1), ld(1).

3.1.1.1 Porting Issues (continued)

A program can fail to port for other reasons. Since different compilers handle standards violations differently, non-standard-compliant programs often generate different results when rehosted. An often seen example of this is a Fortran program which assumes that all variables are initialized to 0. While this initialization occurred on most older machines, it is not done in general on newer, stack-based architectures. When a program making such an assumption is rehosted to a new machine, it will almost certainly generate incorrect results.

To detect this error, one can compile with the -trapuv flag. This flag initializes all unitialized variables in a program with the value 0xFFFA5A5A. When this value is used as a floating point variable, it is treated as a floating point NaN and it will cause a floating point trap. When it is used as a pointer, an address or segmentation violation will most likely occur. By running the program with a debugger, such as dbx(1), the use of such an uninitialized variable can be pinpointed, and the program can then be fixed to do the proper initialization.

While fixing the program is the "right thing to do," it does require some effort by the programmer. An alternative, which requires almost no effort, is to use the -static flag. This flag causes all variables to be allocated from a static data area rather than the stack, and these variables will then be initialized to 0. There is a cost to the easy way out, however. Use of the -static flag generally hurts program performance; a 20% penalty is not uncommon. Another trick which can sometimes help in rehosting a program is to compile with the lowest optimization level, -O0 (i.e., no optimization), but this, too will significantly hurt performance.

Other types of porting failures can occur as well. The results generated on one computer may disagree with those from another computer because of differences in the floating point format, or compiler differences which cause the roundoff error to change. One needs to understand the particular program in order to determine if these differences are significant. In addition, some programs may fail to even compile on a new system if extensions are not supported or are provided with a different syntax; Fortran pointers provide a good example of this. In such situations, the code must be modified to be compatible with the new compiler. Similarly, if a program makes calls to a particular vendor's libraries, these calls will need to be replaced with equivalent entry points from the new vendor's libraries.

Finally, programs sometimes have mistakes in them which by pure luck had previously not caused problems. But when compiled on a new system, errors finally appear. Writing beyond the end of an array can show this behavior. Such mistakes need to be corrected. For f77 compilations, one can use the -C (or, equivalently, -check_bounds) flag to instruct the Fortran compiler to generate code for runtime subscript range checking. If an out-of-bounds array reference is made, the program will abort and display an error message indicating the illegal reference which was made and which line of code it occurred on:

Subscript out of range on file line 6, procedure MAIN_.
Attempt to access the 0-th element of variable ?.
*** Execution Terminated (99) ***

3.1.2 Symbolic Debuggers

When tracking down the source of incorrect results, a symbolic debugger is an invaluable tool. Two such debuggers are available for the Origin2000 system. dbx is the standard Unix debugger which has been available for many years. It uses a simple, teletype style interface and may be used to debug sequential programs or multi-processor programs, although the MP debugging is somewhat awkward. cvd, the WorkShop debugger, uses a graphical user interface and provides greater fuctionality and efficiency than dbx; it is also much more convenient for debugging multi-processor programs. For information on how to use these products, consult the dbx(1) reference page and the Developer Magic: Debugger User's Guide.

3.2 Step 2: Use Existing Tuned Code

The quickest and easiest way to improve the performance of a program is to link it with libraries already tuned for the target hardware. In addition to the standard libraries, such as libc, whose hardware-specific versions are automatically linked in with programs compiled and run on the Origin2000 system, there are two other libraries which may optionally be used which can provide substantial performance benefits.

3.2.1 libfastm

libfastm provides highly optimized versions of a subset of the routines found in libm, the standard math library. Optimized scalar versions of sin, cos, tan, exp, log, and pow for both single- and double-precision are provided. To link in libfastm, append -lfastm to the end of the link line:

cc -o program <objects> -lfastm

Separate versions of libfastm exist for the R10000, R8000 and R5000. In order to link with the correct version, the desired chip should be explicitly requested with the -r10000, -r8000 or -r5000 flags, respectively. When used on the compile line, these flags also instruct the compiler to generate code scheduled for the specified chip. For Origin2000 systems, the -r10000 flag should be used when compiling and linking.

In addition the scalar routines just mentioned, the following vector intrinsics are now part of the standard math library: vacos, vasin, vatan, vcos, vexp, vlog, vsin, vsqrt and vtan, plus the single precision versions whose names are derived from those of these double precision versions by appending an f. This library is linked in automatically by Fortran; the -lm flag should be used when linking with ld or the C compiler:

cc -o program <objects> -lm

More information on the math intrinsics can be found in math(3M) and libfastm(3M), and an description of when the compiler will use the vector math intrinsics automatically is presented later.

3.2.2 `complib.sgimath`

The other library which provides optimized code is complib.sgimath. This is a library of mathematical routines for carrying out linear algebra operations, FFTs and convolutions. It contains optimized implementations of the public domain libraries LINPACK, EISPACK, LAPACK, FFTPACK, and the Level-1, -2 and -3 BLAS, plus proprietary sparse solvers tuned specifically for Silicon Graphics, Inc. platforms.

Many of the complib routines have been parallelized, so there are separate sequential and parallel versions of the library. To link in the sequential version of complib.sgimath, add -lcomplib.sgimath to the link line:

f77 -o program <objects> -lcomplib.sgimath -lfastm

To link in the parallel version, use -lcomplib.sgimath_mp; in addition, the -mp flag must be used when linking in the parallel code:

f77 -mp -o program <objects> -lcomplib.sgimath_mp -lfastm

Since complib.sgimath uses some routines found in libfastm, libfastm should be linked in after complib.sgimath.

3.2.3 Documentation

For more information about these libraries, see the man pages for libfastm(3M), math(3M), complib(3F), blas(3F), lapack(3F), fft(3F), conv(3F), solvers(3F), psldu(3F), psldlt(3F), and pskysl(3F).

3.3 Step 3: Find Out Where to Tune

When confronted with a program composed of hundreds of modules and thousands of lines of code, it would require a very heroic, and very inefficient, effort to tune the entire program. Tuning needs to be concentrated on those few sections of the code where the work will pay off with the biggest gains in performance. These sections of code are identified with the help of a profiler.

3.3.1 Profiling Tools

The hardware counters in the R10000 CPU make it possible to profile the behavior of a program in many ways without modifying the code. The software tools are:

Perfex runs a program and reports exact counts of any two selected events from the R10000 counters. Alternatively, it time-multiplexes all 32 countable events and reports extrapolated totals of each. Perfex is useful for identifying what problem (for example, secondary data cache misses) is hurting the performance of your program the most.
Speedshop (actually, the ssrun command) runs a program while sampling the state of the program counter and stack, and writing the sample data to a file for later analysis. You select the time-base for the sampling and the particular type of information to be sampled. Speedshop is useful for locating where in your program the performance problems occur.
Prof analyzes a Speedshop data file and displays it in any of a variety of formats.
Dprof, like Speedshop, samples a program while it is executing, but dprof records memory access information as a histogram file. It is used to identify which data structures in the program are involved in performance problems.

You use these tools to find out what constrains the program, and which parts of the it consume the most time. Through the use of a combination of these tools, it is possible to identify most performance problems.

3.3.2 Hardware Counter Registers

The profiling tools depend for most of their features on the R10000's Performance Counter Registers. These on-chip registers can be programmed to count hardware events as they happen, for example machine cycles, instructions, branch predictions, floating point instructions, or cache misses.

There are only two Performance Counter registers. Each can be programmed to count machine cycles or one of 15 other events, for a total of 32 events that can be counted (30 of which are distinct). The specific events are summarized in the following table which can always be obtained by using the command perfex -h. A detailed discussion of the events is contained in this sidebar on counter definitions.

Event
Number Counter 0 Event Event
Number Counter 1 Event

0 Cycles 16 Cycles

1 Instructions issued to functional units 17 Instructions graduated

2 Memory data access (load, prefetch, sync, cacheop) issued 18 Memory data loads graduated

3 Memory stores issued 19 Memory data stores graduated

4 Store-conditionals issued 20 Store-conditionals graduated

5 Store-conditionals failed 21 Floating-point instructions graduated

6 Branches decoded 22 Quadwords written back from L1 cache

7 Quadwords written back from L2 cache 23 TLB refill exceptions

8 Correctable ECC errors on L2 cache 24 Branches mispredicted

9 L1 cache misses (instruction) 25 L1 cache misses (data)

10 L2 cache misses (instruction) 26 L2 cache misses (data)

11 L2 cache way mispredicted (instruction) 27 L2 cache way mispredicted (data)

12 External intervention requests 28 External intervention request hits in L2 cache

13 External invalidate requests 29 External invalidate request hits in L2 cache

14 Instructions done (formerly, virtual coherency condition) 30 Stores, or prefetches with store hint, to CleanExclusive L2 cache blocks

15 Instructions graduated 31 Stores, or prefetches with store hint, to Shared L2 cache blocks.

Event Number	Counter 0 Event	Event Number	Counter 1 Event
0	Cycles	16	Cycles
1	Instructions issued to functional units	17	Instructions graduated
2	Memory data access (load, prefetch, sync, cacheop) issued	18	Memory data loads graduated
3	Memory stores issued	19	Memory data stores graduated
4	Store-conditionals issued	20	Store-conditionals graduated
5	Store-conditionals failed	21	Floating-point instructions graduated
6	Branches decoded	22	Quadwords written back from L1 cache
7	Quadwords written back from L2 cache	23	TLB refill exceptions
8	Correctable ECC errors on L2 cache	24	Branches mispredicted
9	L1 cache misses (instruction)	25	L1 cache misses (data)
10	L2 cache misses (instruction)	26	L2 cache misses (data)
11	L2 cache way mispredicted (instruction)	27	L2 cache way mispredicted (data)
12	External intervention requests	28	External intervention request hits in L2 cache
13	External invalidate requests	29	External invalidate request hits in L2 cache
14	Instructions done (formerly, virtual coherency condition)	30	Stores, or prefetches with store hint, to CleanExclusive L2 cache blocks
15	Instructions graduated	31	Stores, or prefetches with store hint, to Shared L2 cache blocks.

The counters are 64-bit integers. When a counter overflows, a hardware trap occurs. The kernel can preload a counter with (2^64 - n) to cause a trap after n counts occur. The profiling tools use this capability. For example, the command ssrun -gi_hwc programs the graduated instruction counter (event 17) to overflow every 32K counts. Each time the counter overflows, ssrun samples the program counter and stack state of the subject program.

The reference page r10k_counters(5) gives detailed information on how the counters can be accessed through the /proc interface. This is the interface used by the profiling tools. The interface hides the division of events between only two registers, and allows the software to view the counters as a single set of 32, 64-bit counters. The operating system time-multiplexes the active counters between the events being counted. This requres sampling and scaling, and can introduce some error when more than two events are counted.

In general, it is better to access the counters through the profiling tools. A program that uses the counter interface directly cannot be profiled using perfex, nor using ssrun for any experiment that depends on counters. When a program must access counter values directly, the simplest interface is through libperfex, documented in the libperfex(3) reference page.

3.3.3 How to Do Performance Analysis Using perfex

The simplest profiling tool is perfex, documented in the perfex(1) reference page. It runs a subject program and records data about the run, similar to timex(1):

% perfex [options] command [arguments]

The subject program and its arguments are given. perfex sets up the counter interface and forks the subject program. When the program ends, perfex writes counter data to standard output.

perfex gathers its information with no modifications to your existing program. While this is very convenient, the data obtained come from the entire run of the program. If you only want to profile a particular section of the program, you will need to use the library interface to perfex, libperfex(3). To use this interface, you insert into your program's source code a call to initiate counting and another to terminate it; a third call is used to print the counts gathered. The program must then be linked with the libperfex library:

% f77|cc -o program <objects> -lperfex

Since you can use SpeedShop to see where in a program various event counts come from, we will not describe libperfex in detail. More information can be found in its reference page.

3.3.3.1 Absolute Counts of One or Two Events

Using perfex options you specify what is to be counted. You can specify one or two countable events. In this case, the counts are absolute and accurate. For example, the command

% perfex -e 15 -e 21 adi2

runs the subject program and reports the exact counts of graduated instructions and graduated floating point instructions. You use this mode when you want to explore specific points of program behavior.

3.3.3.2 Statistical Counts of All Events

When you specify option -a (all events), perfex multiplexes all 32 events over the program run. Each count is active 1/16 of the time, and then scaled by 16. The resulting counts have some statistical error. The error is small (and the counts are sufficiently repeatable) provided that the subject program runs in a stable execution mode for a number of seconds. When the program runs for a short time, or shifts between radically different regimes of instruction or data use, the counts are less dependable and less repeatable. Nevertheless, perfex -a usually gives a good overview of program operation.

Here is the perfex command line and output applied to a sample program called adi2:

% perfex -a -x adi2
WARNING: Multiplexing events to project totals--inaccuracy possible.
 Time:      7.990 seconds
 Checksum:  5.6160428338E+06
 0 Cycles......................................................   1645481936
 1 Issued instructions.........................................    677976352
 2 Issued loads................................................    111412576
 3 Issued stores...............................................     45085648
 4 Issued store conditionals...................................            0
 5 Failed store conditionals...................................            0
 6 Decoded branches............................................     52196528
 7 Quadwords written back from scache..........................     61794304
 8 Correctable scache data array ECC errors....................            0
 9 Primary instruction cache misses............................         8560
10 Secondary instruction cache misses..........................          304
11 Instruction misprediction from scache way prediction table..          272
12 External interventions......................................         6144
13 External invalidations......................................        10032
14 Virtual coherency conditions................................            0
15 Graduated instructions......................................    371427616
16 Cycles......................................................   1645481936
17 Graduated instructions......................................    400535904
18 Graduated loads.............................................     90474112
19 Graduated stores............................................     34776112
20 Graduated store conditionals................................            0
21 Graduated floating point instructions.......................     28292480
22 Quadwords written back from primary data cache..............     32386400
23 TLB misses..................................................      5687456
24 Mispredicted branches.......................................       410064
25 Primary data cache misses...................................     16330160
26 Secondary data cache misses.................................      7708944
27 Data misprediction from scache way prediction table.........       663648
28 External intervention hits in scache........................         6144
29 External invalidation hits in scache........................         6864
30 Store/prefetch exclusive to clean block in scache...........      7582256
31 Store/prefetch exclusive to shared block in scache..........         8144

The -x option requests that perfex also gather counts for kernel code that handles exceptions, so the work done by the OS to handle TLB misses is included in these counts.

3.3.3.3 Analytical Output with perfex -y

The raw event counts are interesting but it is more useful to convert them to elapsed time. Some time estimates are simple, for example dividing the cycle count by the machine clock rate gives elapsed run time (1645481936/195MHZ = 8.44 seconds). Other events are not so simple, and can only be stated in terms of a range of times. For example, the time to handle a primary cache miss varies depending on whether the needed data are in the secondary cache, or in memory, or in the cache of another CPU. Analysis of this kind can be requested using perfex -a -x -y.

When you use -a , -x and -y, perfex collects and displays all event counts, but it also displays a report of estimated times based on the counts. Here is an example, again of the program adi2:

% perfex -a -x -y adi2
WARNING: Multiplexing events to project totals--inaccuracy possible.
 Time:      7.996 seconds
 Checksum:  5.6160428338E+06

                                                                    Based on 196 MHz IP27
                                                                                  Typical      Minimum      Maximum
   Event Counter Name                                          Counter Value   Time (sec)   Time (sec)   Time (sec)
===================================================================================================================
 0 Cycles......................................................   1639802080     8.366337     8.366337     8.366337
16 Cycles......................................................   1639802080     8.366337     8.366337     8.366337
26 Secondary data cache misses.................................      7736432     2.920580     1.909429     3.248837
23 TLB misses..................................................      5693808     1.978017     1.978017     1.978017
 7 Quadwords written back from scache..........................     61712384     1.973562     1.305834     1.973562
25 Primary data cache misses...................................     16368384     0.752445     0.235504     0.752445
22 Quadwords written back from primary data cache..............     32385280     0.636139     0.518825     0.735278
 2 Issued loads................................................    109918560     0.560809     0.560809     0.560809
18 Graduated loads.............................................     88890736     0.453524     0.453524     0.453524
 6 Decoded branches............................................     52497360     0.267844     0.267844     0.267844
 3 Issued stores...............................................     43923616     0.224100     0.224100     0.224100
19 Graduated stores............................................     33430240     0.170562     0.170562     0.170562
21 Graduated floating point instructions.......................     28371152     0.144751     0.072375     7.527040
30 Store/prefetch exclusive to clean block in scache...........      7545984     0.038500     0.038500     0.038500
24 Mispredicted branches.......................................       417440     0.003024     0.001363     0.011118
 9 Primary instruction cache misses............................         8272     0.000761     0.000238     0.000761
10 Secondary instruction cache misses..........................          768     0.000290     0.000190     0.000323
31 Store/prefetch exclusive to shared block in scache..........        15168     0.000077     0.000077     0.000077
 1 Issued instructions.........................................    673476960     0.000000     0.000000     3.436107
 4 Issued store conditionals...................................            0     0.000000     0.000000     0.000000
 5 Failed store conditionals...................................            0     0.000000     0.000000     0.000000
 8 Correctable scache data array ECC errors....................            0     0.000000     0.000000     0.000000
11 Instruction misprediction from scache way prediction table..          432     0.000000     0.000000     0.000002
12 External interventions......................................         6288     0.000000     0.000000     0.000000
13 External invalidations......................................         9360     0.000000     0.000000     0.000000
14 Virtual coherency conditions................................            0     0.000000     0.000000     0.000000
15 Graduated instructions......................................    364303776     0.000000     0.000000     1.858693
17 Graduated instructions......................................    392675440     0.000000     0.000000     2.003446
20 Graduated store conditionals................................            0     0.000000     0.000000     0.000000
27 Data misprediction from scache way prediction table.........       679120     0.000000     0.000000     0.003465
28 External intervention hits in scache........................         6288     0.000000     0.000000     0.000000
29 External invalidation hits in scache........................         5952     0.000000     0.000000     0.000000

Statistics
=========================================================================================
Graduated instructions/cycle................................................     0.222163
Graduated floating point instructions/cycle.................................     0.017302
Graduated loads & stores/cycle..............................................     0.074595
Graduated loads & stores/floating point instruction.........................     5.422486
Mispredicted branches/Decoded branches......................................     0.007952
Graduated loads/Issued loads................................................     0.808696
Graduated stores/Issued stores..............................................     0.761099
Data mispredict/Data scache hits............................................     0.078675
Instruction mispredict/Instruction scache hits..............................     0.057569
L1 Cache Line Reuse.........................................................     6.473003
L2 Cache Line Reuse.........................................................     1.115754
L1 Data Cache Hit Rate......................................................     0.866185
L2 Data Cache Hit Rate......................................................     0.527355
Time accessing memory/Total time............................................     0.750045
L1--L2 bandwidth used (MB/s, average per process)...........................   124.541093
Memory bandwidth used (MB/s, average per process)...........................   236.383187
MFLOPS (average per process)................................................     3.391108

"Maximum," "minimum," and "typical" time cost estimates are reported. Each is obtained by consulting an internal table which holds the "maximum," "minimum," and "typical" costs for each event, and multiplying this cost by the count for the event. Event costs are usually measured in terms of machine cycles, and so the cost of an event generally depends on the clock speed of the processor, which is also reported in the output.

The "maximum" value contained in the table corresponds to the worst case cost of a single occurrence of the event. Sometimes this can be a very pessimistic estimate. For example, the maximum cost for graduated floating point instructions assumes that all such instructions are double precision reciprocal square roots, since that is the most costly R10000 floating point instruction.

Due to the latency-hiding capabilities of the R10000, the "minimum" cost of virtually any event could be zero since most events can be overlapped with other operations. To avoid simply reporting minimum costs of 0, which would be of no practical use, the "minimum" time reported by perfex -y corresponds to the best case cost of a single occurrence of the event. The "best case" cost is obtained by running the maximum number of simultaneous occurrences of that event and averaging the cost. For example, two floating point instructions can complete per cycle, so the best case cost is 0.5 cycles per floating point instruction.

The "typical" cost falls somewhere between "minimum" and maximum" and is meant to correspond to the cost one would expect to see in average programs.

perfex -y prints the event counts and associated cost estimates sorted from most costly to least costly. While resembling a profiling output, this is not a true profile. The event costs reported are only estimates. Furthermore, since events do overlap with each other, the sum of the estimated times will usually exceed the program's run time. This output should only be used to identify which events are responsible for significant portions of the program's run time, and to get a rough idea of what those costs might be.

For the example above, it is clear that the program is spending a significant fraction of its time in handling secondary cache and TLB misses. To make a significant improvement in the runtine of this program, the tuning measures need to be concentrated on reducing those cache misses.

In addition the event counts and cost estimates, perfex -y also reports a number of statistics derived from the typical costs. The meaning of many of the statistics is self-evident, for example, Graduated instructions/cycle. Below we list those statistics whose definitions require more explanation:

Data mispredict/Data scache hits: This is the ratio of the counts for "Data misprediction from scache way prediction table" and "Secondary data cache misses."
Instruction mispredict/Instruction scache hits: This is the ratio of the counts for "Instruction misprediction from scache way prediction table" and "Secondary instruction cache misses."
L1 Cache Line Reuse: The is the number of times, on average, that a primary data cache line is used after it has been moved into the cache. It is calculated as "graduated loads" plus "graduated stores" minus "primary data cache misses," all divided by "primary data cache misses."
L2 Cache Line Reuse: The is the number of times, on average, that a secondary data cache line is used after it has been moved into the cache. It is calculated as "primary data cache misses" minus "secondary data cache misses," all divided by "secondary data cache misses."
L1 Data Cache Hit Rate: This is the fraction of data accesses which are satisfied from a cache line already resident in the primary data cache. It is calculated as 1.0 - ("primary data cache misses" divided by the sum of "graduated loads" and "graduated stores").
L2 Data Cache Hit Rate: This is the fraction of data accesses which are satisfied from a cache line already resident in the secondary data cache. It is calculated as 1.0 - ("secondary data cache misses" divided by; "primary data cache misses").
Time accessing memory/Total time: This is the sum of the typical costs of "graduated loads," "graduated stores," "primary data cache misses," "secondary data cache misses," and "TLB misses," divided by the total program run time. The total program run time is calculated by multiplying "cycles" by the time per cycle (inverse of the processor's clock speed).
L1--L2 bandwidth used (MB/s, average per process): This is the amount of data moved between the primary and secondary data caches, divided by the total program run time. The anmount of data moved is calculated as the sum of the number of "primary data cache misses" multiplied by the primary cache line size and the number of "quadwords written back from primary data cache" multiplied by the size of a quadword (16 bytes). For multiprocess programs, the resulting figure is a per-process average since the; counts measured by perfex are aggregates of the counts for all the threads. One needs to multiply by the number of threads to get the total program bandwidth.
Memory bandwidth used (MB/s, average per process): This is the amount of data moved between the secondary data cache and main memory, divided by the total program run time. The anmount of data moved is calculated as the sum of the number of "secondary data cache misses" multiplied by the secondary cache line size and; the number of "quadwords written back from secondary data cache" multiplied by the size of a quadword (16 bytes). For multi-processor programs, the resulting figure is a per-process average since the counts measured by perfex are aggregates of the counts for all the threads. One needs to multiply by the number of threads to get the total program bandwidth.
MFLOPS (MB/s, average per process): This is the ratio of the "graduated floating point instructions" and the total program run time. Note that while a multiply-add carries out two floating point operations, it only counts as one; instruction, so this statistic may underestimate the number of floating point operations per second. For multi-processor programs, the resulting figure is a per-process average since the counts measured by perfex are aggregates of the counts for all the threads. One; needs to multiply by the number of threads to get the total program rate.

These statistics can give you a quick way to identify performance problems in your program. For example, the cache hit rate statistics tell you how cache-friendly your program is. Since a secondary cache miss is much more expensive than a cache hit, the L2 Data Cache Hit Rate needs to be quite close to 1 for the program to not be paying a large penalty for the cache misses. Values above about 0.96 indicate good cache performance. Note that for the above example, the rate is 0.53, further confirmation of the cache problems in this program.

3.3.3.4 Analysis with Local Tools

There is nothing magical about the output of perfex, and you are free to write your analysis tools. For example, this awk script processes the output of perfex -a and reorders it, inserting derived ratios after the absolute count lines. You can write other simple scripts that extract the counter values of particular interest in a given situation.

3.3.4 Using SpeedShop

The purpose of profiling is to find out exactly where a program is spending its time; that is, in precisely which procedures or lines of code. Then you can concentrate your efforts on the (usually small) areas of code where there is the most to be gained.

You do profiling using the SpeedShop package. It supports these methods of profiling:

Sampling, in which the subject program is frequently interrupted; the program counter (PC) and stack are recorded on each interruption. The more frequently the PC is found in a particular procedure, the more execution time that procedure costs.
SpeedShop can sample based on a variety of time bases: the system timer, or any of several of the R10000 Performance Counters.
Ideal counting, in which a copy of the subject program binary is modified with trap instructions at the end of each basic block. During execution, the exact number of uses of each basic block is counted.
Exception trace, not really a profiling method, records only floating-point exceptions and their locations.

The SpeedShop package has three parts:

ssrun performs experiments and collects data.
prof processes data and prepares reports.
the ssapi interface allows you to insert caliper points into a program so as to profile specific sections of code or phases of execution.

These programs are documented in the following reference pages, which you should print and save for frequent reference:

speedshop(1) documents the types of experiments, as well as a number of environment variables you can set.
ssrun(1) documents specific options of ssrun. You need both speedshop(1) and ssrun(1) to run an experiment.
prof(1) documents the report types and the options you use to get them.
ssapi(3) documents the three library calls you can use.

3.3.4.1 PC Sampling Profiling

The accuracy of sampling depends on the time base that sets the sampling interval. In each case, the time base is the independent variable, and the program state is the dependent variable. You select from the sampling methods summarized in this table:

ssrun
Option Time Base Comments

-usertime 30ms timer Fairly coarse resolution; experiment runs quickly and output file is small; some bugs noted in speedshop(1).

-pcsamp[x]
-fpcsamp[x] 10ms timer
1ms timer Moderately coarse resolution; functions that cause cache misses or page faults are emphasized. Suffix x for 32-bit counts.

-gi_hwc
-fgi_hwc 32771 insts
6553 insts Fine-grain resolution based on graduated instructions. Emphasizes functions that burn a lot of instructions.

-cy_hwc
-fcy_hwc 16411 clocks
3779 clocks Fine-grain resolution based on elapsed cycles. Emphasizes functions with cache misses and mispredicted branches.

-ic_hwc
-fic_hwc 2053 icache miss
419 icache miss Emphasizes code that doesn't fit in L1 cache.

-isc_hwc
-fisc_hwc 131 scache miss
29 scache miss Emphasizes code that doesn't fit in L2 cache. Should be coarse-grained measure.

-dc_hwc
-fdc_hwc 2053 dcache miss
419 dcache miss Emphasizes code that causes L1 cache data misses.

-dsc_hwc
-fdsc_hwc 131 scache miss
29 scache miss Emphasizes code that causes L2 cache data misses.

-tlb_hwc
-ftlb_hwc 257 TLB miss
53 TLB miss Emphasizes code that causes page faults.

-gfp_hwc
-fgfp_hwc 32771 fp insts
6553 fp insts Emphasizes code that performs heavy FP calculation.

-prof_hwc user-set Hardware counter and overflow value from environment variables.

ssrun Option	Time Base	Comments
-usertime	30ms timer	Fairly coarse resolution; experiment runs quickly and output file is small; some bugs noted in speedshop(1).
-pcsamp[x] -fpcsamp[x]	10ms timer 1ms timer	Moderately coarse resolution; functions that cause cache misses or page faults are emphasized. Suffix x for 32-bit counts.
-gi_hwc -fgi_hwc	32771 insts 6553 insts	Fine-grain resolution based on graduated instructions. Emphasizes functions that burn a lot of instructions.
-cy_hwc -fcy_hwc	16411 clocks 3779 clocks	Fine-grain resolution based on elapsed cycles. Emphasizes functions with cache misses and mispredicted branches.
-ic_hwc -fic_hwc	2053 icache miss 419 icache miss	Emphasizes code that doesn't fit in L1 cache.
-isc_hwc -fisc_hwc	131 scache miss 29 scache miss	Emphasizes code that doesn't fit in L2 cache. Should be coarse-grained measure.
-dc_hwc -fdc_hwc	2053 dcache miss 419 dcache miss	Emphasizes code that causes L1 cache data misses.
-dsc_hwc -fdsc_hwc	131 scache miss 29 scache miss	Emphasizes code that causes L2 cache data misses.
-tlb_hwc -ftlb_hwc	257 TLB miss 53 TLB miss	Emphasizes code that causes page faults.
-gfp_hwc -fgfp_hwc	32771 fp insts 6553 fp insts	Emphasizes code that performs heavy FP calculation.
-prof_hwc	user-set	Hardware counter and overflow value from environment variables.

Each time base finds the program PC more often in the code that consumes the most units of that time base:

The time bases that reflect actual elapsed time (-usertime, -pcsamp, -cy_hwc) find the PC more often in the code where the program spends elapsed time. The time may be spent in that code because it is executed a lot, or it might be spent there because those instructions are processed slowly owing to cache misses, memory contention, or failed branch prediction. Use these to get an overview of the program and to find major trouble spots.
The time bases that reflect instruction counts (-gi_hwc, -gfp_hwc) find the PC more often in the code that actually performs the most instructions. Use these to find the code that could benefit most from algorithmic changes.
The time bases that reflect data access (-dc_hwc, -sc_hwc, -tlb_hwc) find the PC more often in the code that has to wait for its data to be brought in from another memory level. Use these to find memory access problems.
The time bases that reflect code access (-ic_hwc, -isc_hwc) find the PC more often in the code that has to be fetched from memory when it is called. Use these to pinpoint functions that could be reorganized for better locality, or to see when automatic inlining has gone too far.

It is easy to perform an experiment. Here is the application of an experiment to program adi2:

% ssrun -fpcsamp adi2
 Time:      7.619 seconds
 Checksum:  5.6160428338E+06

The output file of samples is left in a file with the default name of ./command.experiment.pid. In this case the name was adi2.fpcsamp.4885. However, it is often more convenient to dictate the name of the output file. You can do this by putting the desired filename and directory in environment variables. Using this csh script you can run an experiment, passing the output directory and filename on the command line, for example

% ssruno -d /var/tmp -o adi2.cy -cy_hwc adi2
ssrun -cy_hwc adi2 ..................................
 Time:      9.644 seconds
 Checksum:  5.6160428338E+06
.................................. ssrun ends.
-rw-r--r--    1 guest    guest      18480 Dec 17 16:25 /var/tmp/adi2.cy

3.3.4.2 Using `prof`

Regardless of which time base you use for sampling, you display the result using prof. By default, prof displays a list of procedures ordered from the one with the most samples to the least:

% prof adi2.fpcsamp.4885
-------------------------------------------------------------------------------
Profile listing generated Sat Jan  4 10:28:11 1997
    with:       prof adi2.fpcsamp.4885 
-------------------------------------------------------------------------------

samples   time    CPU    FPU   Clock   N-cpu  S-interval Countsize
   8574   8.6s R10000 R10010 196.0MHz   1      1.0ms     2(bytes)

Each sample covers 4 bytes for every  1.0ms ( 0.01% of 8.5740s)


-------------------------------------------------------------------------------
  -p[rocedures] using pc-sampling.
  Sorted in descending order by the number of samples in each procedure.
  Unexecuted procedures are excluded.
-------------------------------------------------------------------------------

samples   time(%)      cum time(%)      procedure (dso:file)

   6688   6.7s( 78.0)  6.7s( 78.0)         zsweep (adi2:adi2.f)
    671  0.67s(  7.8)  7.4s( 85.8)         xsweep (adi2:adi2.f)
    662  0.66s(  7.7)    8s( 93.6)         ysweep (adi2:adi2.f)
    208  0.21s(  2.4)  8.2s( 96.0)       fake_adi (adi2:adi2.f)
    178  0.18s(  2.1)  8.4s( 98.1)         irand_ (/usr/lib32/libftn.so:../../libF77/rand_.c)
    166  0.17s(  1.9)  8.6s(100.0)          rand_ (/usr/lib32/libftn.so:../../libF77/rand_.c)
      1 0.001s(  0.0)  8.6s(100.0)      __oserror (/usr/lib32/libc.so.1:oserror.c)

   8574   8.6s(100.0)  8.6s(100.0)          TOTAL

From this profile, it is clear that our attention needs to be focused on the routine zsweep, since it consumes almost 80% of the runtime of this program.

For finer detail, use the -heavy option. This supplements the list of procedures with a list of individual source line numbers, ordered by frequency:

-------------------------------------------------------------------------------
  -h[eavy] using pc-sampling.
     Sorted in descending order by the number of samples in each line.
     Lines with no samples are excluded.
-------------------------------------------------------------------------------

samples   time(%)      cum time(%)      procedure (file:line)

   3405   3.4s( 39.7)  3.4s( 39.7)         zsweep (adi2.f:122)
   3226   3.2s( 37.6)  6.6s( 77.3)         zsweep (adi2.f:126)
    425  0.42s(  5.0)  7.1s( 82.3)         xsweep (adi2.f:80)
    387  0.39s(  4.5)  7.4s( 86.8)         ysweep (adi2.f:101)
    273  0.27s(  3.2)  7.7s( 90.0)         ysweep (adi2.f:105)
    246  0.25s(  2.9)    8s( 92.9)         xsweep (adi2.f:84)
    167  0.17s(  1.9)  8.1s( 94.8)         irand_ (../../libF77/rand_.c:62)
    163  0.16s(  1.9)  8.3s( 96.7)       fake_adi (adi2.f:18)
    160  0.16s(  1.9)  8.5s( 98.6)          rand_ (../../libF77/rand_.c:69)
     45 0.045s(  0.5)  8.5s( 99.1)       fake_adi (adi2.f:59)
     32 0.032s(  0.4)  8.5s( 99.5)         zsweep (adi2.f:113)
     21 0.021s(  0.2)  8.5s( 99.7)         zsweep (adi2.f:121)
     11 0.011s(  0.1)  8.6s( 99.8)         irand_ (../../libF77/rand_.c:63)
      6 0.006s(  0.1)  8.6s( 99.9)          rand_ (../../libF77/rand_.c:67)
      4 0.004s(  0.0)  8.6s(100.0)         zsweep (adi2.f:125)
      1 0.001s(  0.0)  8.6s(100.0)         ysweep (adi2.f:104)
      1 0.001s(  0.0)  8.6s(100.0)         ysweep (adi2.f:100)
      1 0.001s(  0.0)  8.6s(100.0)      __oserror (oserror.c:127)

   8574   8.6s(100.0)  8.6s(100.0)          TOTAL

From this listing it is clear that lines 122 and 126 warrant further inspection. Even finer detail can be obtained with the -source option, which lists the source code and disassembled machine code, indicating sample hits on specific instructions.

3.3.4.3 Ideal Time Profiling

The other type of profiling is called ideal time, or basic block, profiling. Basic block is a compiler term for a section of code which has only one entrance and one exit. Any program can be decomposed into basic blocks.

To obtain an ideal profile, ssrun makes a copy of the executable program and modifies the copy to contain code that records the entry to each basic block. Not only the executable itself, but also all DSOs that it links to are copied and instrumented. The instrumented executable and libraries are statically linked and run:

% ssrun -ideal adi2
Beginning libraries
        /usr/lib32/libssrt.so
        /usr/lib32/libss.so
        /usr/lib32/libfastm.so
        /usr/lib32/libftn.so
        /usr/lib32/libm.so
        /usr/lib32/libc.so.1
Ending libraries, beginning "adi2"
 Time:      8.291 seconds
 Checksum:  5.6160428338E+06

The number of times each basic block was encountered is recorded. The output data file is displayed using prof, just as for a sampled run. The report ranks source and library procedures from most-used to least:

% prof adi2.ideal.4920

  Prof run at: Sat Jan  4 10:34:06 1997
 Command line: prof adi2.ideal.4920 


          285898739: Total number of cycles
           1.45867s: Total execution time
          285898739: Total number of instructions executed
              1.000: Ratio of cycles / instruction
                196: Clock rate in MHz
             R10000: Target processor modelled


---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------

        cycles(%)  cum %     secs    instrns    calls procedure(dso:file)

  68026368(23.79)  23.79     0.35   68026368 32768 xsweep(adi2.pixie:adi2.f)
  68026368(23.79)  47.59     0.35   68026368 32768 ysweep(adi2.pixie:adi2.f)
  68026368(23.79)  71.38     0.35   68026368 32768 zsweep(adi2.pixie:adi2.f)
  35651584(12.47)  83.85     0.18   35651584 2097152 rand_(./libftn.so.pixn32:../../libF77/rand_.c)
  27262976( 9.54)  93.39     0.14   27262976 2097152 irand_(./libftn.so.pixn32:../../libF77/rand_.c)
  18874113( 6.60)  99.99     0.10   18874113     1 fake_adi(adi2.pixie:adi2.f)
     11508( 0.00)  99.99     0.00      11508     5 memset(./libc.so.1.pixn32:/slayer_xlv0/ficussg-nov05/work/irix/lib/libc/libc_n32_M4/strings/bzero.s)
      3101( 0.00)  99.99     0.00       3101    55 __flsbuf(./libc.so.1.pixn32:_flsbuf.c)
      2446( 0.00) 100.00     0.00       2446    42 x_putc(./libftn.so.pixn32:../../libI77/wsfe.c)
      1234( 0.00) 100.00     0.00       1234     2 x_wEND(./libftn.so.pixn32:../../libI77/wsfe.c)
      1047( 0.00) 100.00     0.00       1047     1 f_exit(./libftn.so.pixn32:../../libI77/close.c)
      1005( 0.00) 100.00     0.00       1005     5 fflush(./libc.so.1.pixn32:flush.c)
       639( 0.00) 100.00     0.00        639     4 do_fio64_mp(./libftn.so.pixn32:../../libI77/fmt.c)
       566( 0.00) 100.00     0.00        566     3 wrt_AP(./libftn.so.pixn32:../../libI77/wrtfmt.c)
       495( 0.00) 100.00     0.00        495     6 map_luno(./libftn.so.pixn32:../../libI77/util.c)
       458( 0.00) 100.00     0.00        458    14 op_gen(./libftn.so.pixn32:../../libI77/fmt.c)
       440( 0.00) 100.00     0.00        440     9 gt_num(./libftn.so.pixn32:../../libI77/fmt.c)
       414( 0.00) 100.00     0.00        414     1 getenv(./libc.so.1.pixn32:getenv.c)
                                    .
                                    .
                                    .

The -heavy option adds a list of source lines, sorted by their consumption of ideal instruction cycles.

An ideal profile shows exactly and repeatedly which statements are most executed, and gives you a very exact view of the algorithmic complexity of the program. An ideal profile does not necessarily reflect where a program spends its time, since it cannot take cache and TLB misses into account. Consequently, the results of the ideal profile are startlingly different than that of the PC sampling profile. These ideal results indicate that zsweep should take exactly the same amount of runtime as ysweep and xsweep. These differences can be used to infer where cache performance issues exist. On machines without the R10000's hardware profiling registers, such comparisons are the only profiling method available to locate cache problems.

3.3.4.3.1 Operation Counts

Since ideal profiling counts the instructions executed by the program, it can provide all sorts of interesting information about the program. Already printed in the standard prof output are counts of how many times each subroutine is called. In addition, you may use the -op option to prof to get a listing detailing the counts of all instructions in the program. In particular, this will provide an exact count of the floating point operations executed:

% prof -op adi2.ideal.4920

  Prof run at: Wed Jan 15 14:42:54 1997
 Command line: prof -op adi2.ideal.4920


          285898739: Total number of cycles
           1.45867s: Total execution time
          285898739: Total number of instructions executed
              1.000: Ratio of cycles / instruction
                196: Clock rate in MHz
             R10000: Target processor modelled


---------------------------------------------------------
pixstats summary
---------------------------------------------------------
           56590456: Floating point operations (38.796 Mflops @ 196 MHz)
          105500230: Integer operations (72.3265 M intops @ 196 MHz)

                             .
                             .
                             .

Note that this is different than what one obtains using perfex. The R10000 counter #21 counts floating point instructions, not floating point operations. As a result, in a program which executes a lot of multiply-add instructions --- each of which carries out 2 floating point operations --- perfex's MFLOPS statistic can be off by a factor of almost 2. Since prof -op records all instructions executed, it counts each multiply-add instruction as two floating point operations, thus providing the correct tally. The Mlops figure it calculates, however, is based on the ideal time, so to calculate floating point performance, you should divide the number of floating point operations counted by prof -op by wall clock time.

Either method of profiling, PC sampling or ideal, can be applied to multi-processor runs just as easily as it is applied to single-processor runs: each thread of an application maintains its own histogram, and the histograms may be printed individually or merged in any combination and printed as one profile.

3.3.4.4 Gprof

One of the limitations of the prof output for either PC sampling or ideal time is that the information reported contains no information about the call hierarchy. That is, if the routine zsweep in the above example were called from two different locations in the program, we would have no idea how much time results from the call at each location; we would only know the total time spent in zsweep. If we knew that, say, the first location was responsible for the majority of the time, this could affect how we tune the program. For example, we might try inlining the call into the first location, but not bother with the second. Or, if we wanted to parallelize the program, knowing that the first location is where the majority of the time is spent, we might consider parallelizing the calls to zsweep there rather than trying to parallelize the zsweep routine itself.

SpeedShop provides two methods of obtaining hierarchical profiling information. The first method is called gprof and it is used in conjunction with the ideal time profile. To obtain the gprof information for the above example, we simply add the flag -gprof to the prof command:

% prof -gprof adi2.ideal.4920

  Prof run at: Wed Jan 15 16:52:09 1997
 Command line: prof -gprof adi2.ideal.4920


          285898739: Total number of cycles
           1.45867s: Total execution time
          285898739: Total number of instructions executed
              1.000: Ratio of cycles / instruction
                196: Clock rate in MHz
             R10000: Target processor modelled


---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------

        cycles(%)  cum %     secs    instrns    calls procedure(dso:file)

  68026368(23.79)  23.79     0.35   68026368 32768 xsweep(adi2.pixie:adi2.f)
  68026368(23.79)  47.59     0.35   68026368 32768 ysweep(adi2.pixie:adi2.f)
  68026368(23.79)  71.38     0.35   68026368 32768 zsweep(adi2.pixie:adi2.f)
  35651584(12.47)  83.85     0.18   35651584 2097152 rand_(./libftn.so.pixn32:../../libF77/rand_.c)
  27262976( 9.54)  93.39     0.14   27262976 2097152 irand_(./libftn.so.pixn32:../../libF77/rand_.c)
                                                  .
                                                  .
                                                  .

          All times are in milliseconds.
--------------------------------------------------------------------------------
          NOTE: any functions which are not part of the call
                graph are listed at the end of the gprof listing
--------------------------------------------------------------------------------

                                    self             kids        called/total    parents
 index          cycles(%)          self(%)          kids(%)  called+self     name     index
                                    self             kids        called/total    children

   [1]  285895481(100.00%)         57( 0.00%)  285895424(100.00%)     0       __start [1]
                                      50          285895369         1/1         main [2]
                                       3                  0         1/1         __istart [112]
                                       2                  0         1/1         __readenv_sigfpe [113]

--------------------------------------------------------------------------------

                                      50          285895369         1/1         __start [1]
   [2]  285895419(100.00%)         50( 0.00%)  285895369(100.00%)     1       main [2]
                                18874113          267020445         1/1         fake_adi [3]
                                     205                606         5/5         signal [44]

--------------------------------------------------------------------------------

                                18874113          267020445         1/1         main [2]
   [3]  285894558(100.00%)   18874113( 6.60%)  267020445(93.40%)     1       fake_adi [3]
                                68026368                  0     32768/32768     zsweep [4]
                                68026368                  0     32768/32768     ysweep [5]
                                68026368                  0     32768/32768     xsweep [6]
                                35651584           27262976     2097152/2097152     rand_ [7]
                                      28              13486         2/2         s_wsfe64 [9]
                                      26               5368         2/2         e_wsfe [17]
                                      22               2610         1/1         do_fioxr4v [25]
                                      22               2610         1/1         do_fioxr8v [24]
                                      23               2428         1/1         s_stop [28]
                                     114                 44         2/2         dtime_ [68]

--------------------------------------------------------------------------------

                                68026368                  0     32768/32768     fake_adi [3]
   [4]   68026368(23.79%)   68026368(100.00%)          0( 0.00%) 32768       zsweep [4]

--------------------------------------------------------------------------------

                                68026368                  0     32768/32768     fake_adi [3]
   [5]   68026368(23.79%)   68026368(100.00%)          0( 0.00%) 32768       ysweep [5]

--------------------------------------------------------------------------------

                                68026368                  0     32768/32768     fake_adi [3]
   [6]   68026368(23.79%)   68026368(100.00%)          0( 0.00%) 32768       xsweep [6]

--------------------------------------------------------------------------------
                                                  .
                                                  .
                                                  .

This produces the usual ideal time profiling output, but following that is the hierarchical information. There is a block of information for each subroutine in the program. A number, shown in brackets (e.g., [1]), is assigned to each routine so that the information pertaining to it can easily be located in the output. Let's look in detail at the block of information provided; we'll use fake_adi [3] as an example.

The line beginning with the number [3] shows, from left to right:

The number of cycles consumed by this routine and the routines it calls (its descendants);
The number of cycles spent inside the routine, but not in any of its descendants;
The number of cycles spent in its descendants;
The total number of times the routine was called;
The name of the routine, fake_adi [3].

Above this line are lines showing which routines fake_adi [3] was called from. In this case, it is only called from one place, main [2], so there is just one line (in general, there would be one line for each routine which calls fake_adi [3]). This line shows:

The proportion of the cycles spent inside fake_adi [3] due to the call from main [2];
The proportion of time spent in fake_adi [3]'s descendsants due to the call from main [2];
How many calls there are to fake_adi [3] in main [2] / the total number of calls to fake_adi [3] from all places in the program.

Since fake_adi [3] is only called once, all the time in it and its descendants is due to this one call.

Below the line beginning with the number [3] are all the decendants of fake_adi [3]. For each of the descendants we see:

The proportion of the descendant's cycles spent inside it;
The proportion of the descendant's cycles spent in its decendant's (i.e., fake_adi [3]'s grandchildren);
How many calls to the descendant there are in fake_adi [3] / the total number of calls to the descendant from all places in the program.

This block of information allows us to determine not just which subroutines, but which paths in the program, are responsible for the majority of time. The only limitation to this is that gprof reports ideal time, so cache misses are not represented.

3.3.4.5 Usertime

Since gprof only reports ideal time, it would be nice to have a way to get hierarchical profiling information which accurately accounts for all time in the program, the way PC sampling does. Well, there is a way, and it is called usertime profiling. For this type of profiling, the program is sampled during the run. At each sample, not only is the location of the program counter noted, but the entire call stack is traced to record which routines have been called in order to get to this point in the program. From this a hierarchical profile is constructed. Since unwinding the call stack is a reasonably expensive operation, the sampling period is for usertime profiling is relatively long: 30 milliseconds.

usertime profiling is performed with the following command:

% ssrun -usertime adi2

The output is written to a file called ./adi2.usertime.pid, where pid is the process ID for this run of adi2. The profile is displayed using prof just as for PC sampling and ideal time profiling:

% prof adi2.usertime.19572

The output is as follows:

-------------------------------------------------------------------------------
Profile listing generated Wed Jan 15 16:57:10 1997
    with:       prof adi2.usertime.19572
-------------------------------------------------------------------------------

        Total Time (secs)     : 9.99
        Total Samples         : 333
        Stack backtrace failed: 0
        Sample interval (ms)  : 30
        CPU                   : R10000
        FPU                   : R10010
        Clock                 : 196.0MHz
        Number of CPUs        : 1


-------------------------------------------------------------------------------

index  %Samples     self descendents  total        name
[1]    100.0%    0.00        9.99     333          __start
[2]    100.0%    0.00        9.99     333          main
[3]    100.0%    0.09        9.90     333          fake_adi
[4]     80.8%    8.07        0.00     269          zsweep
[5]      7.5%    0.75        0.00      25          xsweep
[6]      6.9%    0.69        0.00      23          ysweep
[7]      3.9%    0.12        0.27      13          rand_
[8]      2.7%    0.27        0.00       9          irand_

The information is less detailed than that provided by gprof, but when combined with what we know from gprof, we can get a complete hierarchical profile for all routines which have run long enough to be sampled.

3.3.4.6 Hardware Counter Profiling

As described above, PC sampling can be performed using a variety of time bases for the sampling interval. We have already seen examples of [f]pcsamp, which uses a constant time base of either 1 or 10 milliseconds, and usertime, which uses a constant time base of 30 milliseconds. The other time bases which may be used are the R10000 counter registers.

These counters can be programmed to generate an interrupt after counting an event a specified number of times; this is called the overflow value. This interrupt is then used to generate a profile. The most commonly used counters have been given names so that they may be used directly with the ssrun command. These are:

[f]gi_hwc: graduate instructions
[f]cy_hwc: cycle counter
[f]ic_hwc: primary instruction cache misses
[f]isc_hwc: secondary instruction cache misses
[f]dc_hwc: primary data cache misses
[f]dsc_hwc: secondary datacache misses
[f]tlb_hwc: TLB misses
[f]gfp_hwc: graduated floating point instructions

The f-variants sample at a faster rate by setting the overflow to a lower value. The overflow values for these experiment types were givenin the table above.

In addition to these named counters, a catch-all experiment type, prof_hwc, has been provided so that counters which have not been explicitly named may also be sampled. The counter number and overflow value are specified via environment variables as follows:

% setenv _SPEEDSHOP_HWC_COUNTER_NUMBER   <cntr>
% setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW <ovfl>

As an example, let's sample TLB misses in adi2:

% ssrun -tlb_hwc adi2
 Time:      7.887 seconds
 Checksum:  5.6160428338E+06
% prof adi2.tlb_hwc.18520
-------------------------------------------------------------------------------
Profile listing generated Wed Jan 15 19:27:52 1997
    with:       prof adi2.tlb_hwc.18520 
-------------------------------------------------------------------------------

        Counter               : TLB misses
        Counter overflow value: 257
        Total numer of ovfls  : 21181
        CPU                   : R10000
        FPU                   : R10010
        Clock                 : 195.0MHz
        Number of CPUs        : 1


-------------------------------------------------------------------------------
  -p[rocedures] using counter overflow.
  Sorted in descending order by the number of overflows in each procedure.
  Unexecuted procedures are excluded.
-------------------------------------------------------------------------------

  overflows(%)  cum overflows(%)      procedure (dso:file)

  21169( 99.9)      21169( 99.9)         zsweep (adi2:adi2.f)
      4(  0.0)      21173(100.0)         ysweep (adi2:adi2.f)
      4(  0.0)      21177(100.0)         xsweep (adi2:adi2.f)
      4(  0.0)      21181(100.0)       fake_adi (adi2:adi2.f)

  21181(100.0)                      TOTAL

The profile reports the number of overflows of the TLB miss counter which occurred in each subroutine. To get the total number of TLB misses, which is what perfex would report, we need to multiply by the overflow value, 257.

This profile confirms that virtually all the TLB misses occur in the routine zsweep.

3.3.4.7 Exception Profiling

Floating point exceptions that are not specifically handled by the program (using the handle_sigfpes library procedure, documented in the handle_sigfpes(3C) reference pages) cause hardware traps nevertheless, and then are ignored in software. A high number of ignored exceptions can cause a program to run slowly for no apparent reason.

The ssrun experiment -fpe creates a trace file that records all floating point exceptions. You use prof to display the profile as usual. The report shows the procedures and lines that caused exceptions. In effect, this is a sampling profile in which the sampling time base is the occurrence of an exception.

You can use exception profiling to verify that a program does not have significant exceptions. When you know that, you may safely set a higher level of exception handling for the compiler using -TENV:X which can sometimes provide a small gain in performance.

When a program does generate a significant number of exceptions, this is almost invariably due to floating point underflows (a large number of divide-by-zero or invalid operations are likely to be fatal to the progam!). A large number of underflows will cause performance problems since, unless they are flushed to zero in hardware, they will be handled by software in the operating system kernel. So a lot of underflows will generate a lot of traps to the kernel to finish the calculations. While the results will be correct, excess system time will result. On R8000-based systems, the default is to flush underflows to zero, but on R10000-, R5000- and R4400-based systems, the default is to let the kernel handle the gradual underflow.

If gradual underflow is not required for numerical accuracy, underflows can be flushed to zero. This can be done by two methods, both of which are described in detail in handle_sigfpes(3C):

Link in the floating point exception library, libfpe, and use the TRAP_FPE environment variable or the handle_sigfpes subroutine call to indicate that underflows be flushed to 0. The environment variable is very easy to use:
```
  % ld -o program ... -lfpe
  % setenv TRAP_FPE "UNDERFL=ZERO"
  % program ...
```
Important Note: If you have compiled the program with a higher than default value for the -TENV:X flag, you should not use libfpe to enable any exceptions which the compiler now assumes are not enabled. This could very easily result in the program generating mysterious floating point exceptions. For example, if you compile with -TENV:X=3, the compiler assumes that floating point divide-by-zero exceptions are turned off. As a result, it may replace an innocent construct such as
```
      if (x .ne. 0.0) b = 1.0/x
```
with the following speculative code
```
      tmp = 1.0/x
      if (x .ne. 0.0) b = tmp
```
which may divide by zero. If the divide-by-zero exception is disabled, this speculative code cannot cause problems since if tmp is ever infinity, it is not stored in b. However, if libfpe has been used to enable the divide-by-zero exception, your program will abort if the calculation of tmp divides by zero, and it could be pretty painful to figure out what happened.

Use the set_fpc_csr() call from within the program to explicitly set the hardware bit controlling whether underflows are flushed to 0. This is done as follows cf. the handle_sigfpes(3C) man page:

  #include <sys/fpu.h>

  void
  flush_to_zero(int on_off)
  {
  union fpc_csr   n;

       n.fc_word = get_fpc_csr();
       if ( on_off == 0 ) {
            n.fc_struct.flush = 0;
       } else {
            n.fc_struct.flush = 1;
       }
       set_fpc_csr(n.fc_word);
  }

The hardware bit which controls how underflows are handled is located in a part of the R10000 floating point unit called the floating point status register (FSR). In addition to setting the way underflows are treated, this register is also used to select which IEEE rounding mode is used and which IEEE exceptions are enabled (i.e., not ignored). Furthermore, it contains the floating point condition bits, which are set by floating point compare instructions, and cause bits, which are set when any floating point exception occurs (whether or not the exception is ignored). This register is documented in the MIPS R10000 Microprocessor User's Manual.

The fpc_csr union, defined in <sys/fpu.h>, mirrors the floating point status register. One can set/clear bits in the FSR using this union. All one needs do is retrieve the current value of the FSR using get_fpc_csr(), set/clear the appropriate bit(s) in the fpc_csr union, and then call set_fpc_csr() to write the updated values into the register.

The names of the bits are given in the union definition:

/*
 * Structure and constant definisions for the floating-point control
 * control and status register (fpc_csr).
 */
#ifdef _LANGUAGE_C
union fpc_csr {
        unsigned int fc_word;
        struct {
#ifdef _MIPSEB
                unsigned fcc : 7;       /* only used for mips4 */
                unsigned flush : 1;
                unsigned condition : 1;
                unsigned reserved1 : 5;

                unsigned ex_unimplemented : 1;
                unsigned ex_invalid : 1;
                unsigned ex_divide0 : 1;
                unsigned ex_overflow : 1;
                unsigned ex_underflow : 1;
                unsigned ex_inexact : 1;

                unsigned en_invalid : 1;
                unsigned en_divide0 : 1;
                unsigned en_overflow : 1;
                unsigned en_underflow : 1;
                unsigned en_inexact : 1;

                unsigned se_invalid : 1;
                unsigned se_divide0 : 1;
                unsigned se_overflow : 1;
                unsigned se_underflow : 1;
                unsigned se_inexact : 1;

                unsigned rounding_mode : 2;
        } fc_struct;
};
#endif /* _LANGUAGE_C */

The only bits which control actions are the flush bit (flush), the enable bits (en_*) and the rounding mode bits (rounding_mode).

Setting the flush bit to 1 means underflows are flushed to zero in hardware. Setting it to 0 means that underflows are not flushed to zero: they will generate an unimplemented instruction exception which tells the OS to perform gradual underflow in software.

The enable bits turn on various IEEE exceptions if set to 1; they disable them if set to 0. The names pretty much explain what they do (except that underflows are potentially pre-empted by the flush bit).

Finally, the 4 IEEE rounding modes are:

Round-to-nearest representable value
Round towards 0
Round to Plus Infinity
Round to Minus Infinity

Any floating point option can be controlled by writing a routine like the one above which sets/ clears the appropriate FSR bits.

3.3.5 Address Space Profiling

Speedshop and perfex profile the execution path, but there is a second dimension to program behavior: data references.

The dprof tool, like ssrun, executes a program and samples the program while it runs. Unlike ssrun, dprof does not sample the PC and stack. It samples the current operand address of the interrupted instruction, and accumulates a histogram of access to data addresses, grouped in virtual page units. The following is an example of running dprof on adi2:

% dprof -hwpc -out adi2.dprof adi2
 Time:     27.289 seconds
 Checksum:  5.6160428338E+06
% ls -l adi2.dprof
-rw-r--r--    1 guest      40118 Dec 18 18:54 adi2.dprof
% cat adi2.dprof
--------------------------------------------------
   address     thread       reads          writes
--------------------------------------------------
0xfffeff4000     0            168              11
0xfffeff8000     0            228               7
0xfffeffc000     0            156              11
0xffff000000     0            126               7
0xffff004000     0            153               9
0xffff008000     0             71              12
0xffff00c000     0             40               3
0xffff010000     0            116              10
0xffff014000     0            219              38
0xffff018000     0            225              73
0xffff01c000     0            193              53
0xffff020000     0            127              41
0xffff024000     0            246              31
0xffff028000     0            147              18
0xffff02c000     0             50              23
0xffff030000     0             87              47
0xffff034000     0            134              36
0xffff038000     0            165              37
0xffff03c000     0            136              37
0xffff040000     0             98              32
...
0xfffffe8000     0            389              46
0xfffffec000     0            481              50
0xffffff0000     0            262              36
0xffffff4000     0             35               8

Each line contains a count of all references from one program thread (one IRIX process, but this program has only a single process) to one virtual memory page. (The addresses at the left increment by 0x04000, or 16KB.)

dprof operates by statistical sampling; it does not record all references. The time base is either the interval timer (option -itimer) or the R10000 event cycle counter (option -hwpc). The default interrupt frequency using the cycle counter samples only about 1 instruction in 10000. This produces a coarse sample that is not likely to be repeatable from run to run. However, even this sampling rate slows program execution by almost 3 times.

You can obtain a more detailed sample by specifying a shorter overflow count (option -ovfl), but this will extend program execution time proportionately. The coarse histogram is useful for showing which pages are used. For example, you can plot the data as a histogram. Using gnuplot (which is available on the SGI Freeware CDROM), a simple plot of total access density is obtained as follows:

% /usr/freeware/bin/gnuplot

        G N U P L O T
        unix version 3.5 
        patchlevel 3.50.1.17, 27 Aug 93
        last modified Fri Aug 27 05:21:33 GMT 1993 

        Copyright(C) 1986 - 1993   Thomas Williams, Colin Kelley

        Send comments and requests for help to info-gnuplot@dartmouth.edu
        Send bugs, suggestions and mods to bug-gnuplot@dartmouth.edu

Terminal type set to 'x11'
gnuplot> plot "<tail +4 adi2.dprof|awk '{print NR,$3+$4,$2}'" with box

For this is a single-threaded application, the information provided by dprof is not of great benefit. But for multi-threaded applications, it can reveal quite a lot. Since access to the local memory is faster than access to remote memories on an Origin system, the best performance on parallel programs which sustain a lot of cache misses will be achieved if you can arrange it so that each thread primarily accesses its local memory. dprof gives you the ability to analyze which regions of virtual memory each thread of a program accesses. Since dprof's utility is primarily for multi-threaded applications, we will postpone a more detailed look at how it can be used until later.

3.4 Step 4: Let the Compiler Do the Work

Once the amount of code to be tuned has been narrowed down, it's time to actually do the work of tuning. The most important tool available for accomplishing this is the compiler. Ideally, the compiler should be able to automatically make your program run its fastest. But this simply isn't possible because the compiler does not have enough information to always make optimal decisions.

As an example consider a program which obtains its input data from a file. The size of the data are not known until run time. For those data sets which are too big to fit in cache, the best performance will be achieved if the compiler has performed cache optimizations on the code. But for those data sets which do fit in cache, these optimizations should not be done since they will add overhead and make the program run somewhat slower. So what should the compiler do?

Most compilers are limited in the cache optimizations they can perform, so they optimize for in-cache code; this is the case for the MIPSpro 6.x compilers. The nice aspect of this choice is that it makes it fairly easy to verify that the compiler has properly optimized a program: The transformations it makes apply to relatively small sections of code --- typically loops --- and in the case of the MIPSpro 6.x compilers, "report cards" can be generated to show how well each loop has been optimized. Unfortunately, most data sets don't fit in cache, so the loop-by-loop report cards usually don't indicate the true performance of the program.

The MIPSpro 7.x compilers, however, can perform extensive cache optimizations. If you request the highest optimization level, the compiler will optimize for data which do not fit in cache. This means that the majority of programs will benefit from the cache optimizations. But those programs which operate on in-cache data, or in which the programmer has manually made cache optimzations, may have to pay a performance penalty. Fortunately, this penalty will typically be small compared to the advantage most programs gain from cache optimzations. So, given that the compiler must make a choice, this is on balance the right one. However, it does mean that if you understand what your program is doing, you may be able improve its performance by selectively enabling and disabling the compiler's cache optimizations.

This, of course, applies to the other optimizations as well, and is the subject of the following sections. Coming up, we will look at the types of optimizations the compiler is capable of performing and the switches which enable or disable them. For ease of exposition, we will describe the various optimizations in separate sections as if they are independent of each other. The major categories of optimizations covered will include

Software Pipelining
Inter-Procedural Analysis
Loop Nest optimizations

In reality, these optimizations are not independent of each other: some loop nest optimizations can enhance software pipelining, and inter-procedural analysis can give the loop nest optimizer more information from which to better do its job. As a result, there is perhaps no best order in which to describe them. The above order is convenient, however, since it starts with operations which are fairly localized (i.e., loops) and then progresses to more complicated and global transformations (i.e., cache optimizations).

This order, however, should not be construed as a ranking of their importance in helping your program achieve its best performance. To the contrary, the loop nest optimzations are probably the most important in this respect since they attempt to make your program more cache-friendly. Since these optimizations can be complicated and understanding them requires a knowledge of how caches work, many users will prefer to just let the compiler accomplish what it can automatically. However, for sophistocated users who wish greater control over cache optimizations, the compiler does allow great flexibility in how they are applied. This topic is significant enough that the loop nest optimizations, along with the principles of cache optimizations, will be discussed in a separate section following the other compiler optimzations.

3.4.1 Recommended Flags

Given that the various optimizations the compiler will perform can interact with each other, it is impossible to provide a simple formula specifying which optimizations should be attempted in which order to achieve the best results in all cases. Nevertheless, we can make some recommendations which work well in general. A good set of compiler flags to use are the following:

-n32 -mips4 -Ofast=ip27 -OPT:IEEE_arithmetic=3

These flags:

select the ABI and ISA (use -64 instead of -n32 if you need the full capabilities of the 64-bit operating system, such as the ability to address more than 2 GB of memory),
turn on the highest level of optimization (which includes loop nest optimizations and software pipelining),
tell the compiler to optimize the code for the Origin2000,
enable inter-procedural analysis, and
give the compiler permission to do other optimizations which favor speed over preserving precise numerical rounding rules.

In addition, when linking in math routines, be sure to use the fast math library:

-lfastm -lm

These flags can be used to compile and link in one step:

% cc -n32 -mips4 -Ofast=ip27 -OPT:IEEE_arithmetic=3 \
        -o program program.c -lfastm -lm

or separately:

% cc -n32 -mips4 -Ofast=ip27 -OPT:IEEE_arithmetic=3 -c program.c
% cc -n32 -mips4 -Ofast=ip27 -OPT:IEEE_arithmetic=3 \
        -o program program.o -lfastm -lm

Using this set of flags will get you a good deal of the way towards optimal performance. For some users, this will be more than adequate, so we will finish this section by first showing you how you can easily get into the habit of using these recommended flags, and after that we will describe them in a little more detail. In the following sections, we'll present material for users who want to know more than just the basic optimization flags.

3.4.1.1 Don't Rely on Defaults

In order to make using these flags convenient, it is recommended to get in the habit of defining makefile variables for them. Below is an example of a good starting point for a Makefile; for a simply-constructed program, this may be all you need:

#! /usr/sbin/smake
#
FC      = $(TOOLROOT)/usr/bin/f77
CC      = $(TOOLROOT)/usr/bin/cc
LD      = $(FC)
F77     = $(FC)
SWP     = swplist
RM      = /bin/rm -f
MP      =
ABI     = -n32
ISA     = -mips4
PROC    = ip27
ARCH    = $(MP) $(ABI) $(ISA)
OLEVEL  = -Ofast=$(PROC)
FOPTS   = -OPT:IEEE_arithmetic=3
COPTS   = -OPT:IEEE_arithmetic=3
FFLAGS  = $(ARCH) $(OLEVEL) $(FOPTS)
CFLAGS  = $(ARCH) $(OLEVEL) $(COPTS)
LIBS    = -lfastm
LDFLAGS = $(ARCH) $(OLEVEL)
PROF    =

FOBJS   = f1.o f2.o f3.o
COBJS   = c1.o c2.o c3.o
OBJS    = $(FOBJS) $(COBJS)
EXEC    = executable

$(EXEC):        $(OBJS)
        $(LD) -o $@ $(PROF) $(LDFLAGS) $(QV_LOPT) $(OBJS) $(LIBS)

clean:
        $(RM) $(EXEC) $(OBJS)

.SUFFIXES:
.SUFFIXES: .o .F .c .f .swp

.F.o:
        $(FC)  -c $(FFLAGS) $(QV_OPT) $(DEFINES) $<

.f.o:
        $(FC)  -c $(FFLAGS) $(QV_OPT) $(DEFINES) $<

.c.o:
        $(CC)  -c $(CFLAGS) $(QV_OPT) $(DEFINES) $<

.F.swp:
        $(SWP) -c $(FFLAGS) $(QV_OPT) $(DEFINES) -WK,-cmp=$*.m $<

.f.swp:
        $(SWP) -c $(FFLAGS) $(QV_OPT) $(DEFINES) -WK,-cmp=$*.m $<

.c.swp:
        $(SWP) -c $(CFLAGS) $(QV_OPT) $(DEFINES) $<

Such a Makefile can even be generated automatically with a simple script, which we call makemake. To use it, create a directory for your program and move the source files into the directory. Then, if makemake is in your path, you can create a Makefile for the program by simply typing

% makemake executable

where executable is the name you choose for the program. The script will automatically identify all the C and Fortran files in the directory and list them in the Makefile for you. To build the program, type

% make

and the recommended flags will automatically be used. A version suitable for debugging can be built with the following command:

% make OLEVEL=-g

To add extra compilation options, use the DEFINES variable. For example:

% make DEFINES="-DDEBUG -OPT:alias=restrict"

To remove the executable and objects:

% make clean

3.4.1.2 `-On`

The basic optimization flag is -On, where n is 0, 1, 2, 3, or fast. This flag controls which optimizations the compiler will attempt; the higher the optimization level, the more aggressive the optimizations that will be tried. This has an impact on compilation time: in general, the higher the optimization level, the longer the module will take to compile.

Program modules which consume a noticeable fraction of run time should be compiled with the highest level of optimization. This may be specified with either -O3 or -Ofast. Both flags enable software pipelining and loop nest optimizations. The difference between the two is that -Ofast includes additional optimizations such as inter-procedural analysis (-IPA), arithmetic rearrangements which might cause numerical roundoff differences (-OPT:roundoff=3), and interprets ceratin pointer variables as being independent of one another, which allows faster code to be generated (-OPT:alias=typed). In addition, -Ofast takes an optional argument to allow it to specify to which machine it should target the generated code. For an Origin2000, one should use -Ofast=ip27.

For modules which consume a smaller fraction of the program's runtime, the optimization level used is determined by how much time one is willing to allow for compilation. If compilation time is not an issue, it is easiest to use -O3 or -Ofast=ip27 for all modules. If it is desireable to reduce compilation time, a lower optimzation level may be used for non-performance-critical modules. -O2, or equivalently -O, performs extensive optimzations, but they are generally conservative. That is, they almost always improve runtime performance, and they won't cause numerical roundoff differences; sophistocated optimizations, such as software pipelining and loop nest optimzations, are not performed. -O2 takes longer to compile than -O1, so the latter may be used when there is a strong need to reduce compile time. At this level only very simple optimizations are performed.

The lowest optimzation level, -O0, should only be used when debugging your program. The purpose of this flag is to turn off all optimizations so that there is a direct correspondence between the original source code and the machine instructions the compiler generates; this allows a source-level debugger to show you exactly where in your program the currently executing instruction is. If you don't specify an optimization level, this is what you get, so be sure to explicitly specify the optimization level you want.

3.4.1.3 `-r10000,` `-TARG:proc=r10000 & -TARG:platform=ipxx`

While the R10000, R8000 and R5000 all support the MIPS IV instruction set architecture, their hardware implementations are quite different. Code scheduled for one chip will not necessarily run optimally on another. One can instruct the compiler to generate code optimized for a particular chip via the -r10000, -r8000 and -r5000 flags. When used while linking, these flags will cause libraries optimized for the specified chip, if available, to be linked in rather than generic versions; specially optimized versions of libfastm exist for both the R8000 and R10000. An alternative flag, -TARG:proc=(r4000|r5000|r8000|r10000), will generate code optimized for the specified chip without affecting which libraries are linked in. If -r10000 has been used, -TARG:proc=r10000 is unnecessary.

In addition to specifying the target chip, one can also specify the target processor board. This gives the compiler information about various hardware parameters, such as cache sizes, which can be used in some optimzations. One can specify the target platform with the -TARG:platform=ipxx flag. For Origin2000, -TARG:platform=ip27 should be used.

Note that if the -Ofast=ip27 flag has been used, one does not need to to use the -r10000, -TARG:proc=r10000 or -TARG:platform=ip27 flags.

3.4.1.4 `-OPT:IEEE_arithmetic`

This option controls how strictly the compiler should adhere to the IEEE 754 standard. There is flexibility here for a couple of reasons:

The MIPS IV ISA contains instructions such as reciprocal and reciprocal square root which do not meet the accuracy specified by IEEE 754 standard. While the inaccuracy is small (no more than 2 units in the last place, or 2 ulp), use of these instructions will mean that a program is not strictly IEEE-compliant.
Some standard optimizations produce different results from a strictly IEEE-compliant implementation. For example, converting
```
      do i = 1, n
         b(i) = a(i)/c
      enddo
```
to
```
      tmp = 1.0/c
      do i = 1, n
         b(i) = a(i)*tmp
      enddo
```
is not allowed by the IEEE standard since multiplying by the reciprocal (even a completely IEEE accurate representation of it) rather than dividing can produce results that differ in the least significant place.

For most programs, such strict adherence to the standard is not necessary. This flag provides a way for the user to inform the compiler to what degree it may relax strict IEEE compliance.

This flag permits three levels of conformance as described by the f77(1) man page:

-OPT:IEEE_arithmetic=n
    Specify the level of conformance to IEEE 754 floating point
    arithmetic roundoff and overflow behavior. At level 1 (the
    default), do no optimizations which produce less accurate
    results than required by IEEE 754. At level 2, allow the use
    of operations which may produce less accurate inexact results
    (but accurate exact results) on the target hardware.
    Examples are the recip and rsqrt operators for a MIPS IV
    target. At level 3, allow arbitrary mathematically valid
    transformations, even if they may produce inaccurate results
    for IEEE 754 specified operations, or may overflow or
    underflow for a valid operand range. An example is the
    conversion of x/y to x*recip(y) for MIPS IV targets. See
    also roundoff below.

In general, we recommend using the -OPT:IEEE_arithmetic=3 flag.

3.4.1.5 `-OPT:roundoff`

The order in which arithmetic operations are coded into a program dictates the order in which they are to be carried out. For example, the Fortran statement

      a = b + c + d + e

specifies that b shall first be added to c, then the result added to d, then the result of that added to e, and finally the result of this last addition shall be stored in a. Most of the time, adhering to such a strict order is more than is required to produce numerically valid results. In addition, it is likely to restrict the compiler from taking advantage of all the hardware resources available (i.e., pipelined floating point units).

The -OPT:roundoff flag allows the user to specify how strictly the compiler must adhere to such coding order semantics. Like the -OPT:IEEE_arithmetic flag above, there are several levels. The lowest level enforces strict compliance, while higher levels permit the compiler more freedom to efficiently use the available hardware at the expense of generating results with numerical roundoff different than that generated by the strict order. The four levels are described in the f77(1) man page as follows:

-OPT:roundoff=n
    Specify the level of acceptable departure from source
    language floating-point round-off and overflow semantics.  At
    level 0 (the default at optimization levels -O0 to -O2), do
    no optimizations which might affect the floating-point
    behavior.  At level 1, allow simple transformations which
    might cause limited round-off or overflow differences
    (compounding such transformations could have more extensive
    effects).  At level 2 (the default at optimization level -
    O3), allow more extensive transformations, such as the
    execution of reduction loop iterations in a different order.
    At level 3, any mathematically valid transformation is
    enabled.  Best performance in conjunction with software
    pipelining normally requires level 2 or above, since
    reassociation is required for many transformations to break
    recurrences in loops.  See also IEEE_arithmetic
    above.

In general, we recommend using the -OPT:roundoff=3 flag, but use of this flag is not always without consequences. The reassociations permitted by this flag will cause the roundoff characteristics of the program to change. Generally this is not a problem, but in some instances the program's results can be different enough from what had previously been obtained so as to be considered wrong. While such a situation may call into question the numerical stability of the program, the easiest course of action may be to drop the -OPT:roundoff=3 flag in order to generate the expected results.

The -Ofast flag enables -OPT:roundoff=3. If you are using -Ofast and wish to turn down the roundoff level, you will need to explicitly set the desired level:

% cc -n32 -mips4 -Ofast=ip27 -OPT:roundoff=2 ...

3.4.1.6 Aliasing Models

Using -Ofast implies the use of one additional flag, namely -OPT:alias=typed. This flag specifies the aliasing model to be typed. The aliasing model tells the compiler how to treat potential dependencies caused by dereferencing pointer variables. Without information from the user, the compiler needs to be cautious and assume any two pointers can point to the same location in memory. This caution removes many optimization opportunities, particularly for loops.

To fully understand the implications of this, we'll first need to see how the compiler optimizes loops. This is done via software pipelining, which is covered next. After that, we'll finish the discussion of aliasing models.

3.4.2 Software Pipelining

Loops contain parallelism. This ranges from, at minimum, instruction-level parallelism, up to parallelism sufficient to carry out different iterations of the loop on different processors. We will deal with multi-processor parallelism later; now we want to concentrate on instruction-level parallelism.

Consider the following loop which implements the vector operation DAXPY:

      do i = 1, n
         y(i) = y(i) + a*x(i)
      enddo

This adds a vector multiplied by a scalar to a second vector, overwriting the second vector. Each iteration of the loop requires the following instructions:

2 loads

1 store

1 multiply-add

2 address increments

1 loop-end test 1 branch

The R10000 employs a superscalar architecture capable of executing up to 4 instructions per cycle. These superscalar slots may be filled from the following menu:

1 load or store operation; and/or

1 ALU 1 instruction; and/or

1 ALU 2 instruction; and or

1 floating point add; and/or

1 floating point multiply

In addition, all instructions are pipelined and have associated pipeline latencies ranging from 1 to dozens of cycles. It is up to the compiler to figure out how to best schedule the instructions to fill up as many superscalar slots as possible and run the loop at top speed. Software pipelining is a compiler technique for optimally filling the superscalar slots.

Software pipelining does for loops what hardware pipelining does to speedup hardware instructions. In hardware pipelining, an instruction is decomposed into several atomic operations. Separate hardware is built to execute each atom of work. By overlapping instructions in time, all of the hardware can be run concurrently, each piece operating on an atom from a different instruction. The parallelism is not perfect: there is a pipeline startup phase in which one extra stage of the pipeline may run on each new cycle, and there is a pipeline drain phase, where one unit completes its work on each cycle. But if the number of instructions to carry out is large, these will be swamped by the time spent in the steady-state portion of the pipeline, so the performance will approach that of perfect parallelism.

In sofware pipelining, each loop iteration is broken up into instructions as was done above for DAXPY. The iterations are then overlapped in time in an attempt to keep all the functional units busy. Of course, not every loop can keep all the functional units busy; for example, a loop that does no floating point will not use the floating point adder or multiplier. And some units will be forced to remain idle on some cycles in order to observe the pipeline latencies or since only four instructions can be selected from the above menu. Nevertheless, the compiler can determine the minimum number of cycles required for the steady-state portion of the pipeline, and it will try to schedule the instructions so that this level of performance is approached as the number of iterations in a loop increases.

Do note, however, that in calculating the minimum number of cycles required for the loop, the compiler assumes the data are in-cache (more specifically, in the primary data cache). If the data are not in-cache, the loop will take longer to execute. In this case, the code can generally be made to run faster if the compiler generates some additional instructions to prefetch the data from memory into the cache well before they are needed. This is one of the cache optimizations we will describe later. But for the remainder of this discussion of software pipelining, we will only consider in-cache data. This makes it much easier to determine if the software pipeline steady-state has achieved the peak rate the loop is capable of. In the examples which follow, however, this will require disabling the cache optimizations since they are performed automatically when the recommended flags are used. We hope this does not cause any confusion.

3.4.2.1 Example: DAXPY

To see how effective software pipelining is, we return to the DAXPY example and consider its instruction schedule. We'll state in advance, though, that optimal performance for this loop is 1/3 of the peak floating point rate of the processor since three memory operations are done for each multiply-add, and the processor is capable of at most one memory instruction and one multiply-add per cycle.

Below is the simplest schedule one could come up with for an iteration of this loop:

        Cycle  0:       ld x    x++
        Cycle  1:       ld y
        Cycle  2:
        Cycle  3:                               madd
        Cycle  4:
        Cycle  5:
        Cycle  6:       st y    br      y++

In creating a schedule, several hardware restrictions must be observed:

only one load or store may be performed on each cycle,
a load of a floating point datum from the primary cache has a latency of three cycles,
since a madd instruction does a multiply followed by an add, the addend to the instruction may be loaded just 1 cycle before the madd since two cycles will be expended on the multiply before the addend is needed,
it takes four cycles to complete a madd if the result can be bypassed to another floating point operation, but five cycles if the output must be written to the register file as is required to store the result, and
the source operand of a store instruction is accessed two cycles after the address operand, so the result of a madd instruction may be stored 5-2 = 3 cycles after the madd begins.

The simple schedule for this loop achieves 2 floating point operations in 7 cycles, or 1/7 of peak (while a madd is a single instruction, it performs two floating point operations).

We can improve over this schedule by unrolling the loop so that it caclulates 2 vector elements per iteration:

      do i = 1, n-1, 2
         y(i+0) = y(i+0) + a*x(i+0)
         y(i+1) = y(i+1) + a*x(i+1)
      enddo
      do i = i, n
         y(i+0) = y(i+0) + a*x(i+0)
      enddo

The schedule for the steady-state part of the unrolled loop is

        Cycle  0:       ld x0
        Cycle  1:       ld x1
        Cycle  2:       ld y0   x+=4
        Cycle  3:       ld y1                   madd0
        Cycle  4:                               madd1
        Cycle  5:
        Cycle  6:       st y0
        Cycle  7:       st y1   y+=4    br

This loop achieves 4 floating point operations in 8 cycles, or 1/4 of peak; it's closer to the top rate of 1/3 of peak, but still not perfect.

3.4.2.2 The Software Pipelined Schedule for DAXPY

Here is a compiler-generated software pipelined schedule:

                        [ windup code ]

loop:
        Cycle  0:       ld y4
        Cycle  1:       st y0
        Cycle  2:       st y1
        Cycle  3:       st y2
        Cycle  4:       st y3
        Cycle  5:       ld x5
        Cycle  6:       ld y5
        Cycle  7:       ld x6
        Cycle  8:       ld x7                   madd4
        Cycle  9:       ld y6                   madd5
        Cycle 10:       ld y7   y+=4            madd6
        Cycle 11:       ld x0          bne out  madd7

        Cycle  0:       ld y0
        Cycle  1:       st y4
        Cycle  2:       st y5
        Cycle  3:       st y6
        Cycle  4:       st y7
        Cycle  5:       ld x1
        Cycle  6:       ld y1
        Cycle  7:       ld x2
        Cycle  8:       ld x3                   madd0
        Cycle  9:       ld y2                   madd1
        Cycle 10:       ld y3   y+=4            madd2
        Cycle 11:       ld x4          beq loop madd3
out:
                        [ winddown code ]

This loop achieves 8 floating point operations in 12 cycles, or 1/3 of peak, so its performance is optimal. Let's investigate how it works.

The software pipelined schedule consists of 2 blocks of 12 cycles. Each block is called a replication since the same code pattern repeats for each 12-cycle block. There are four madds and twelve memory operations per repliction (as well as a pointer increment and a branch), so each replication achieves the peak calculation rate. Iterations of the loop are spread out over the replications in a pipeline so that the superscalar slots are filled up.

The diagram below shows the memory operations and madds for twelve iterations of the DAXPY loop. Time, in cycles, runs horizontally across the diagram. During the first five iterations, the software pipeline is filled. This is the windup stage; for ease of presentation, it was omitted from the schedule above. Cycles 9 through 20 make up the first replication; cycles 21 through 32 the second replication; and so on. The compiler has unrolled the loop four times, so each replication completes four iterations. Each iteration is spread out over several cycles: 15 for the first of the four unrolled iterations, 10 for the second, and 9 for each of the third and fourth iterations. Since only one memory operation may be executed per cycle, these four iterations must be offset from each other, and it takes 18 cycles to complete all four of them. Two replications are required to contain the full 18 cycles of work for the four iterations. This is more than the minimum latency required to calculate four iterations (we saw above that this could be done in 14 cycles), but spreading the work out allows the iterations to be interleaved into a perfect pipeline.

A test is done at the end of each iteration to determine if all groups of four iterations have been started. If not, execution drops through to the next replication or branches back to the first replication at the beginning of the loop. If all groups of four iterations have begun, execution jumps out to code which will finish the in-progress iterations. This is the winddown code; for ease of presentation, it was omitted from the schedule above.

Due to the 18-cycle latency, a group of four iterations will begin in one replication, continue through the next, and wrap back around to the replication it started in. In the schedule above, numbers only appear for 8 iterations since this is sufficient to account for all operations in two replications. Higher numbered iterations are mapped to one of these 8 iterations using mod 8 arithmetic.

3.4.2.3 Software Pipelining Messages

In order to help the user determine how effective the compiler has been at software pipelining, the compiler can generate a report card for each loop. The user can request that these report cards be written to a listing file, or assembler output can be generated to see them.

As an example, once again consider the DAXPY loop:

      subroutine daxpy(n,a,x,y)
      integer n
      real*8 a, x(n), y(n)
c
      do i = 1, n
         y(i) = y(i) + a*x(i)
      enddo
c
      return
      end

The software pipelining report card can be generated in two ways:

Use the -S flag to generate an assembler file; the report card will be mixed in with the assembly code. (The flag -LNO:opt=0 is used to prevent certain cache optimzations from being performed which would obscure our current study of software pipelining; this flag will be described later.)
```
% f77 -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -LNO:opt=0 \
        -S daxpy.f 
```
Use the awk script swplist which generates the assember file, extracts the report cards, and inserts each into a source listing file just above the loop it pertains to; the listing file is given the same name as the file compiled, but with the extension .swp. swplist is used with the same options one uses to compile:
```
% swplist -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -LNO:opt=0 \
        -c daxpy.f 
```

The report card generated by either of these methods is

 #<swps>
 #<swps> Pipelined loop line 6 steady state
 #<swps>
 #<swps>    50 estimated iterations before pipelining
 #<swps>     2 unrollings before pipelining
 #<swps>     6 cycles per 2 iterations
 #<swps>     4 flops        ( 33% of peak) (madds count as 2)
 #<swps>     2 flops        ( 16% of peak) (madds count as 1)
 #<swps>     2 madds        ( 33% of peak)
 #<swps>     6 mem refs     (100% of peak)
 #<swps>     3 integer ops  ( 25% of peak)
 #<swps>    11 instructions ( 45% of peak)
 #<swps>     2 short trip threshold
 #<swps>     7 ireg registers used.
 #<swps>     6 fgr registers used.
 #<swps>

The purpose of the report card is to give us information as to how effectively the processor's hardware resources are being used in the schedule generated for the loop. There are lines showing how many instructions of each type there are per replication, and what percentage of each resource's peak utilization has been achieved. Let's look in detail at this information.

The report card is for the loop occurring at line 5 of the source file (the compiler reports a line number that is off by 1 in this case). The loop was unrolled 2 times and then software pipelined. The schedule generated by the compiler requires 6 cycles per replication, and each replication retires 2 iterations due to the unrolling.

The steady-state performance of the loop is 33% of the floating point peak. The "madds count as 2" line counts floating point operations. This indicates how fast the processor can run this loop on an absolute scale. Since the peak rate of the processor is 390 MFLOPS (2 floating point operations per cycle with a clock rate of 195 MHz), the steady-state portion of the loop runs at 130 MFLOPS.

Alternatively, the "madds count as 1" line counts floating point instructions. Since the floating point work in most loops does not consist entirely of madd instructions, this count indicates how much of the floating point capablities of the chip are being utilized in the loop given its particular mix of floating point instructions. A 100% utilization here, even if the floating point operation rate is less than 100% of peak, means that the chip is running as fast as it can.

For the R10000, which has an adder and a multiplier, this line does not provide much additional information beyond the "madds count as 2" line. In fact, the utilization percentage is even misleading in this case since all 2 of the floating point instructions in the replication are madds (cf. the next line in the report card), and they use both the adder and multiplier. Here, the instruction is counted as if it simply uses one of the units. On an R8000, for which there are two complete floating point units, each capable of add, multiply and madd instructions, such anomolies do not occur, and this line can provide more information. One can easily see cases in which the chip's floating point units are 100% utilized even though it is not sustaining two floating point operations per unit per cycle.

Each software pipeline replication contains 6 memory references, and this represents 100% utilization of this resource. Thus, this loop is load/store-bound. Sometimes that's the way it is, and no additional (non-memory) resources can make the loop run faster. But in other situations, loops can be re-written to reduce the number of memory operations, and in these cases performance can be improved. We'll see an example of this later.

Only three integer operations are used: a branch and two pointer increments. The total number of instructions for the replication is 11, so there are plenty of free superscalar slots.

The report card indicates that there is a "short trip threshold" of two. This means that for loops with fewer than two (groups of two) iterations, a separate, non-software-pipelined piece of code will be used. Since building up and draining the pipeline imposes some overhead, pipelining is avoided for very short loops.

Finally, we see that 7 integer registers (ireg) and 6 floating point registers (fgr) were required by the compiler to schedule the code. Although not an issue in this case, bigger and more complicated loops will require more registers, sometimes more than physically exist. In these cases, the compiler must either introduce some scratch memory locations to hold intermediate results, thus increasing the number of load and store operations for the loop, or stretch out the loop to allow greater reuse of the existing registers. Should these problems occur, splitting the loop into smaller pieces can often eliminate them.

This report card provides valuable information to the user. If a loop is running well, that will be validated by the information it contains. If the speed of the loop is disappointing, it will provide clues as to what should be done --- either via the use of additional compiler flags or by modifying the code --- to improve the performance.

3.4.2.4 Use -O3 to Enable Software Pipelining

The compiler will not software pipeline loops by default. In order to enable software pipelining, the highest level of optimization, -O3, must be used. This can be specified with either -Ofast, as above, or simply with -O3; the difference is that -Ofast is a superset of -O3, and also turns on other optimizations such as inter-procedural analysis. (Alternatively, the switch -SWP:=ON can be used with lower levels of optimization, but -O3 also enables loop nest and other optimizations which are desireable in conjunction with software pipelining, so this alternative is not recommended.)

The reason that software pipelining is not done by default is that it increases compilation time. While finding the number of cycles in an optimal schedule is easy, producing a schedule given a finite number of registers is a difficult task. Heuristic algorithms are employed to try to find a schedule. In some instances they will fail, and only a schedule with more cycles than optimal can be generated. The schedule which is found can vary depending on subtle code differences or the choice of compiler flags.

Some loops are well-suited to software pipelining. In general, vectorizable loops, such as the DAXPY example above, will software pipeline well with no special effort from the user. But getting some loops to software pipeline can be difficult. Coming up, we'll see that software pipelining cannot be done on loops containing function calls, and it can be impeded when loops are very large or contain recurrences. Fortunately, inter-procedural analysis, loop nest optimizations, and the use of compiler directives can often help in these situations.

For an in-order superscalar processor such as the R8000, it is essential that the key loops in a program be software pipelined in order to achieve good performance; the functional units will sit idle if concurrent work is not scheduled for them. With its out-of-order execution, the R10000 can be much more forgiving of the code generated by the compiler; even code scheduled for another processor can still run well. Nevertheless, there is a limit to the instruction reordering the processor can accomplish, so getting the most time consuming loops in a program to software pipeline is still key to achieving top performance.

3.4.2.5 Software Pipelining Failures

Not all loops will software pipeline. One of the most common causes for this is a subroutine or function call inside the loop. When a loop contains a call, the software pipelining report card is:

 #<swpf> Loop line 6 wasn't pipelined because:
 #<swpf>       Function call in the DoBody
 #<swpf>

If the subroutine contains loops of its own, then this software pipelining failure is generally inconsequential since it is the loops inside the subroutine that will need to be software pipelined. However, if there are no loops in the subroutine, finding a way to get the loop containing it to software pipeline may be important for achieving good performance.

Failures of this sort can be remedied by inlining the subroutine or functiion which appears in the loop. This could be done manually, but the compiler can often do this automatically via the inlining features of inter-procedural analysis. Inlining will be discussed later.

In addition, while-loops and loops which contain goto statements which jump out of the loop will not software pipeline. Due to the out-of-order execution ability of the R10000, such non-software pipelined loops can still run with acceptable speeds on Origin2000 systems. However, for in-order processors such as the R8000, such loops could incur a significant performance penalty, so if an alternate implementation which converts the loop to a standard do- or for-loop exists, it should be used.

3.4.3 Other Compiler Options and Directives

The -Ofast (or at least -O3) flag is the single most important compiler option to know about and use since it enables software pipelining, inter-procedural analysis (with -Ofast), loop nest and many other optimizations. But there are a handful of other options and directives which can also very useful since they can improve the software pipelined schedules the compiler generates and speed up non-software pipelined code as well. We will describe those next before delving into fine tuning inter-procedural analysis and loop nest optimizations.

3.4.3.1 `-OPT:alias=restrict and -OPT:alias=disjoint`

In order to optimize code effectively, the compiler needs to analyze the data dependencies in the program. For standard Fortran, this task is made easier since the language specifies that non-equivalenced arrays of different names do not alias each other, that is, the pieces of memory they represent do not overlap. When pointers are used --- and hence this is particularly applicable to C programs --- determining aliasing information can be impossible. Thus, in any ambiguous situation, the compiler needs to play it safe and assume aliasing is possible. Unfortunately, this can severly limit the types of optimizations the compiler can do, and it can cause many important loops to fail to software pipeline.

A programmer, however, often knows much more about the actual aliasing behavior than is apparant in the program. The -OPT:alias flag may be used to convey this extra information to the compiler. There are several different aliasing models which may be used, as described in the cc(1) man page:

-OPT:alias=name
    Specify the pointer aliasing model to be used.  If name is
    any, then the compiler will assume that any two memory
    references may be aliased unless it can determine otherwise
    (the default).  If name is typed, the ANSI rules for object
    reference types (Section 3.3) are assumed - essentially,
    pointers of distinct base types are assumed to point to
    distinct, non-overlapping objects. If name is no_typed,
    pointers to different base types may point to the same
    object.  If name is unnamed, pointers are also assumed never
    to point to named objects. If name is no_unnamed, pointer
    may point to named object.  If name is no_restrict, distinct
    pointer variables may point to overlapping storage.  If name
    is restrict, distinct pointer variables are assumed to point
    to distinct, non-overlapping objects.  If name is
    no_disjoint, distinct pointer expressions may point to
    overlapping storage.  If name is disjoint, distinct pointer
    expressions are assumed to point to distinct, non-
    overlapping objects.  The restrict and disjoint options are
    unsafe, and may cause existing C programs to fail in obscure
    ways, so it should be used with extreme care.

If it is the case that different pointers point to non-overlapping regions of memory, then the -OPT:alias=restrict flag should be used to allow the compiler to generate the best performing code. Note, though, that if this flag is used in a situation in which pointers can alias the same data, incorrect code can be generated. This is not a compiler problem: it is a case of the user providing the compiler with false information.

As an example, consider the following C implementation of DAXPY:

void
daxpy(int n, double a, double *x, double *y)

{
  int   i;

  for (i=0; i<n; i++) {
    y[i] += a * x[i];
  }
}

When compiled with

% cc -n32 -mips4 -Ofast=ip27 -OPT:IEEE_arithmetic=3 -LNO:opt=0 \
        -c daxpy.c

the software pipelining message is

 #<swps>
 #<swps> Pipelined loop line 8 steady state
 #<swps>
 #<swps>   100 estimated iterations before pipelining
 #<swps>       Not unrolled before pipelining
 #<swps>     7 cycles per iteration
 #<swps>     2 flops        ( 14% of peak) (madds count as 2)
 #<swps>     1 flop         (  7% of peak) (madds count as 1)
 #<swps>     1 madd         ( 14% of peak)
 #<swps>     3 mem refs     ( 42% of peak)
 #<swps>     3 integer ops  ( 21% of peak)
 #<swps>     7 instructions ( 25% of peak)
 #<swps>     1 short trip threshold
 #<swps>     4 ireg registers used.
 #<swps>     3 fgr registers used.
 #<swps>
 #<swps>     3 min cycles required for resources
 #<swps>     6 cycles required for recurrence(s)
 #<swps>     3 operations in largest recurrence
 #<swps>     7 cycles in minimum schedule found
 #<swps>

(The flag -LNO:opt=0 is used to prevent certain cache optimzations from being performed which would obscure our current study of software pipelining; this flag will be described later.)

Since the compiler cannot determine the data dependencies between x and y, it plays it safe and produces a schedule which is more than twice as slow as ideal.

Now consider what happens when -OPT:alias=restrict is used:

 #<swps>
 #<swps> Pipelined loop line 8 steady state
 #<swps>
 #<swps>    50 estimated iterations before pipelining
 #<swps>     2 unrollings before pipelining
 #<swps>     6 cycles per 2 iterations
 #<swps>     4 flops        ( 33% of peak) (madds count as 2)
 #<swps>     2 flops        ( 16% of peak) (madds count as 1)
 #<swps>     2 madds        ( 33% of peak)
 #<swps>     6 mem refs     (100% of peak)
 #<swps>     3 integer ops  ( 25% of peak)
 #<swps>    11 instructions ( 45% of peak)
 #<swps>     2 short trip threshold
 #<swps>     7 ireg registers used.
 #<swps>     6 fgr registers used.
 #<swps>

This flag has informed the compiler that the x and y arrays are non-overlapping, so it can generate an ideal schedule. As long as this subroutine has been specified so that this assumption is correct, the code will be valid, and it will run significantly faster.

If the code uses pointers to pointers, -OPT:alias=restrict is not sufficient to tell the compiler that data dependencies do not exist. If this is the case, then -OPT:alias=disjoint should be used to allow the compiler to ignore potential data dependencies. (Note that -OPT:alias=disjoint is a superset of -OPT:alias=restrict.) As with -OPT:alias=restrict, if these pointers really can point to the same memory location, then the use of this flag can cause incorrect results.

C code using multi-dimensional arrays is a good candidate for the -OPT:alias=disjoint flag. Consider the following nest of loops:

  for ( i=1; i< size_x1-1 ;i++)
    for ( j=1; j< size_x2-1 ;j++)
       for ( k=1; k< size_x3-1 ;k++)
        {
          out3 = weight3*(    field_in[i-1][j-1][k-1]
                            + field_in[i+1][j-1][k-1]
                            + field_in[i-1][j+1][k-1]
                            + field_in[i+1][j+1][k-1]
                            + field_in[i-1][j-1][k+1]
                            + field_in[i+1][j-1][k+1]
                            + field_in[i-1][j+1][k+1]
                            + field_in[i+1][j+1][k+1] );
          out2 = weight2*(    field_in[i  ][j-1][k-1]
                            + field_in[i-1][j  ][k-1]
                            + field_in[i+1][j  ][k-1]
                            + field_in[i  ][j+1][k-1]
                            + field_in[i-1][j-1][k  ]
                            + field_in[i+1][j-1][k  ]
                            + field_in[i-1][j+1][k  ]
                            + field_in[i+1][j+1][k  ]
                            + field_in[i  ][j-1][k+1]
                            + field_in[i-1][j  ][k+1]
                            + field_in[i+1][j  ][k+1]
                            + field_in[i  ][j+1][k+1] );
          out1 = weight1*(    field_in[i  ][j  ][k-1]
                            + field_in[i  ][j  ][k+1]
                            + field_in[i  ][j+1][k  ]
                            + field_in[i+1][j  ][k  ]
                            + field_in[i-1][j  ][k  ]
                            + field_in[i  ][j-1][k  ] );
          out0 = weight0*     field_in[i  ][j  ][k  ];
          field_out[i][j][k] = out0+out1+out2+out3;
        }

Here, the three-dimensional arrays are declared as

  double***  field_out;
  double***  field_in;

Without input from the user, the compiler cannot tell if there are data dependencies between different elements of one array, let alone both. Due to the multiple levels of indirection, even -OPT:alias=restrict does not provide sufficient information. As a result, compiling with

% cc -n32 -mips4 -Ofast=ip27 -OPT:IEEE_arithmetic=3:alias=restrict \
        -c stencil.c

produces the following software pipeline report card:

 #<swps>
 #<swps> Pipelined loop line 77 steady state
 #<swps>
 #<swps>   100 estimated iterations before pipelining
 #<swps>       Not unrolled before pipelining
 #<swps>    42 cycles per iteration
 #<swps>    30 flops        ( 35% of peak) (madds count as 2)
 #<swps>    27 flops        ( 32% of peak) (madds count as 1)
 #<swps>     3 madds        (  7% of peak)
 #<swps>    42 mem refs     (100% of peak)
 #<swps>    16 integer ops  ( 19% of peak)
 #<swps>    85 instructions ( 50% of peak)
 #<swps>     1 short trip threshold
 #<swps>    20 ireg registers used.
 #<swps>    12 fgr registers used.
 #<swps>

However, we can instruct the compiler that even multiple indirections are independent with the -OPT:alias=disjoint flag. Compiling with this aliasing model instead of the restrict model produces the following results:

 #<swps>
 #<swps> Pipelined loop line 77 steady state
 #<swps>
 #<swps>   100 estimated iterations before pipelining
 #<swps>       Not unrolled before pipelining
 #<swps>    28 cycles per iteration
 #<swps>    30 flops        ( 53% of peak) (madds count as 2)
 #<swps>    27 flops        ( 48% of peak) (madds count as 1)
 #<swps>     3 madds        ( 10% of peak)
 #<swps>    28 mem refs     (100% of peak)
 #<swps>    12 integer ops  ( 21% of peak)
 #<swps>    67 instructions ( 59% of peak)
 #<swps>     2 short trip threshold
 #<swps>    14 ireg registers used.
 #<swps>    15 fgr registers used.
 #<swps>

This is a 50% improvement in performance, and all it took was a flag to tell the compiler that the pointers can't alias each other.

3.4.3.2 The ivdep Directive

The -OPT:alias=[restrict|disjoint] flag is very important for C programs. When the information it conveys is true, it should definitely be used to produce better performing object code. However, there are some situations when even this flag is insufficient to produce the best performing code. Consider the loop in the following C function:

void idaxpy(

int     n,
double  a,
double  *x,
double  *y,
int     *index)

{
  int   i;

  for (i=0; i<n; i++) {
    y[index[i]] += a * x[index[i]];
  }
}

This is the DAXPY operation we have seen before, except now the vector elements are chosen indirectly through an index array. If the vectors x and y are non-overlapping, we should compile the function with -OPT:alias=restrict. Doing so, and using the other recommended flags (plus -LNO:opt=0 to prevent cache optimzations from obscuring some of the details we are interested in here),

% cc -n32 -mips4 -Ofast=ip27 -OPT:IEEE_arithmetic=3:alias=restrict \
        -LNO:opt=0 -c idaxpy.c

we obtain the following less than optimal results:

 #<swps>
 #<swps> Pipelined loop line 13 steady state
 #<swps>
 #<swps>    50 estimated iterations before pipelining
 #<swps>     2 unrollings before pipelining
 #<swps>    10 cycles per 2 iterations
 #<swps>     4 flops        ( 20% of peak) (madds count as 2)
 #<swps>     2 flops        ( 10% of peak) (madds count as 1)
 #<swps>     2 madds        ( 20% of peak)
 #<swps>     8 mem refs     ( 80% of peak)
 #<swps>     8 integer ops  ( 40% of peak)
 #<swps>    18 instructions ( 45% of peak)
 #<swps>     1 short trip threshold
 #<swps>    10 ireg registers used.
 #<swps>     4 fgr registers used.
 #<swps>
 #<swps>     8 min cycles required for resources
 #<swps>    10 cycles in minimum schedule found
 #<swps>

(When run at optimal speed, this loop should be load-store bound. In that case, the memory references should be at 100% of peak.)

The problem is that since the values of the index array are unknown, there could be data dependencies between loop iterations for the vector y. For example, if index[i] = 1 for all i, all iterations would update the same vector element. Note that this type of inter-interation dependency could just as easily arise in a Fortran routine:

      do i = 1, n
         y(index(i)) = y(index(i)) + a * x(index(i))
      enddo

If you know that the values of the array index are such that no dependencies will occur, you can inform the compiler of this via the ivdep compiler directive. For a Fortran program, the directive looks like

 c$dir ivdep
      do i = 1, n
         y(ind(i)) = y(ind(i)) + a * x(ind(i))
      end

In C, the directive is called a pragma, and its use is as follows:

#pragma ivdep
  for (i=0; i<n; i++) {
    y[index[i]] += a * x[index[i]];
  }

When this modified version of the program is compiled as before, the schedule is improved by 25%:

 #<swps>
 #<swps> Pipelined loop line 14 steady state
 #<swps>
 #<swps>    50 estimated iterations before pipelining
 #<swps>     2 unrollings before pipelining
 #<swps>     8 cycles per 2 iterations
 #<swps>     4 flops        ( 25% of peak) (madds count as 2)
 #<swps>     2 flops        ( 12% of peak) (madds count as 1)
 #<swps>     2 madds        ( 25% of peak)
 #<swps>     8 mem refs     (100% of peak)
 #<swps>     8 integer ops  ( 50% of peak)
 #<swps>    18 instructions ( 56% of peak)
 #<swps>     2 short trip threshold
 #<swps>    14 ireg registers used.
 #<swps>     4 fgr registers used.
 #<swps>

What the ivdep pragma has done is give the compiler permission to ignore the dependencies which are possible from one iteration to the next. As long as the dependencies don't actually occur, correct code will be generated. But if an ivdep directive is used when there really are dependencies, the compiler will generated incorrect code. Just as when using the -OPT:alias=[restrict|disjoint] flag, this would be a case of user error.

There are a few subtleties in how the ivdep directive works; we explain these now. First, we note that the directive applies only to inner loops. Second, it applies only to dependencies which are carried by the loop. Consider the following code segment:

Two types of dependencies are possible in this loop. If index(1,i) = index(1,i+k) or index(2,i) = index(2,i+k) for some value of k such that i+k is in the range 1 to n, then we say there is a loop-carried dependency. If index(1,i) = index(2,i) for some value of i, then there is a non-loop-carried dependency. What we saw above in the idaxpy example was a potential loop-carried dependency. The ivdep directive is used to inform the compiler that the dependency is not actually present since we know that the values of the index array are never equal. But if the compiler detects potential non-loop-carried dependencies, the ivdep directive will not instruct the compiler to ignore these potential dependencies. To do this, the behavior of the ivdep directive needs to be modified via the -OPT:liberal_ivdep=TRUE flag. When this flag is used, an ivdep directive means to ignore all dependencies in a loop.

With either variant of ivdep, the compiler will issue a warning if the directive has instructed it to break an obvious dependence. An example of this is the following:

  for (i=0; i<n; i++) {
     a[i] = a[i-1] + 3;
  }

In this loop the dependence is real, so ignoring it will almost certainly mean producing incorrect results. Nevertheless, an ivdep pragma will break the dependence, and the compiler will issue a warning and then schedule the code so that the iterations are carried out independently.

One more variant of ivdep exists and it is enabled with the flag -OPT:cray_ivdep=TRUE. This instructs the compiler to use the semantics that Cray Research specified when it originated this directive for its vectorizing compilers. That is, only "lexically backwards" loop-carried dependencies are broken. The idaxpy examples above is an instance of a lexically backwards dependence. The example below is not:

  for (i=0; i<=n; i++) {
     a[i] = a[i+1] + 3;
  }

Here, the dependence is from the load (a[i+1]) to the store (a[i]), and the load comes lexically before the store. In this case, a Cray semantics ivdep directive would not break the dependence.

3.4.3.3 C Loops

As was seen above in the sections on the -OPT:alias=restrict and -OPT:alias=disjoint flags and the ivdep directive, C code presents the compiler with some data dependency challenges. The extra information these provide allows one to achieve good performance from C despite the language semantics. But there are a few other things that can be done to improve the speed of programs written in C.

Global variables can sometimes cause performance problems since it is difficult for the compiler to determine whether they alias other variables. If loops involving global variables do not software pipeline as efficiently as they should, making local copies of the global variables will often cure the problem.

Another problem arises from using dereferenced pointers in loop tests. For example, the loop

  for (i=0; i<(*n); i++) {
    y[i] += a * x[i];
  }

will not software pipeline since the compiler cannot tell whether *n will change during the loop. If it is known that this value is loop-invariant, one can inform the compiler of this by creating a temporary scalar variable to use in the loop test:

  int   m = *n;

  for (i=0; i<m; i++) {
    y[i] += a * x[i];
  }

This loop will now software pipeline (and if -OPT:alias=[restrict|disjoint] is used, it will achieve optimal performance).

The following example shows a different case of loop-invariant pointers leading to aliasing problems:

void
sub1(

double  **dInputToHidden,
double  *HiddenDelta,
double  *Input,
double  **InputToHidden,
double  alpha,
double  eta,
int     NumberofInputs,
int     NumberofHidden)

{
  int   i;
  int   j;

  for (i=0; i<NumberofInputs; i++) {
    for (j=0; j<NumberofHidden; j++) {
      dInputToHidden[i][j] = alpha * dInputToHidden[i][j] +
                             eta * HiddenDelta[j] * Input[i];
      InputToHidden[i][j] += dInputToHidden[i][j];
    }
  }
}

This loop is similar to a DAXPY: it should be memory bound with about a 50% floating point utilization. But compiling with -OPT:alias=disjoint is not quite sufficient to achieve this level of performance:

 #<swps>
 #<swps> Pipelined loop line 19 steady state
 #<swps>
 #<swps>   100 estimated iterations before pipelining
 #<swps>       Not unrolled before pipelining
 #<swps>     7 cycles per iteration
 #<swps>     5 flops        ( 35% of peak) (madds count as 2)
 #<swps>     4 flops        ( 28% of peak) (madds count as 1)
 #<swps>     1 madd         ( 14% of peak)
 #<swps>     7 mem refs     (100% of peak)
 #<swps>     6 integer ops  ( 42% of peak)
 #<swps>    17 instructions ( 60% of peak)
 #<swps>     3 short trip threshold
 #<swps>    16 ireg registers used.
 #<swps>     8 fgr registers used.
 #<swps>

The problem is the pointers indexed by i: they are causing the compiler to see dependencies which aren't there. This problem can be fixed hoisting the invariants out of the inner loop:

void
sub2(

double  **dInputToHidden,
double  *HiddenDelta,
double  *Input,
double  **InputToHidden,
double  alpha,
double  eta,
int     NumberofInputs,
int     NumberofHidden)

{
  int   i;
  int   j;
  double        *dInputToHiddeni, *InputToHiddeni, Inputi;

  for (i=0; i<NumberofInputs; i++) {

    /* Hoisting loop invariants sometimes helps software pipelining.
     */
    dInputToHiddeni = dInputToHidden[i];
    InputToHiddeni = InputToHidden[i];
    Inputi = Input[i];

    for (j=0; j<NumberofHidden; j++) {
      dInputToHiddeni[j] = alpha * dInputToHiddeni[j] +
                             eta * HiddenDelta[j] * Inputi;
      InputToHiddeni[j] += dInputToHiddeni[j];
    }
  }
}

Written this way, the compiler can see that dInputToHidden[i], InputToHidden[i] and Input[i] will not change during execution of the j-loop, so it can schedule that loop more aggressively. This version, compiled with

% cc -n32 -mips4 -Ofast=ip27 -OPT:IEEE_arithmetic=3:alias=disjoint \
        -LNO:opt=0 -c cloop.c

produces

 #<swps>
 #<swps> Pipelined loop line 52 steady state
 #<swps>
 #<swps>   100 estimated iterations before pipelining
 #<swps>       Not unrolled before pipelining
 #<swps>     5 cycles per iteration
 #<swps>     5 flops        ( 50% of peak) (madds count as 2)
 #<swps>     4 flops        ( 40% of peak) (madds count as 1)
 #<swps>     1 madd         ( 20% of peak)
 #<swps>     5 mem refs     (100% of peak)
 #<swps>     4 integer ops  ( 40% of peak)
 #<swps>    13 instructions ( 65% of peak)
 #<swps>     3 short trip threshold
 #<swps>     8 ireg registers used.
 #<swps>    10 fgr registers used.

(The flag -LNO:opt=0 is used to prevent certain cache optimzations from being performed which would obscure our current study of software pipelining; this flag will be described later.)

3.4.3.4 `-TENV:X=n`

We saw earlier with the ivdep directive that if extra information is conveyed to the compiler, it can generate faster code. A useful bit of information which the compiler can sometimes exploit is: which exceptions may be ignored. An exception is an event which will require intervention by the operating system. Examples are floating point exceptions such as divide by 0, or a segmentation fault, which is a memory exception caused by an invalid memory reference. Any program can choose which exceptions it will pay attention to and which it will ignore. For example, floating point underflows will generate exceptions, but since their effect on the numerical results is usually negligible, they are typically ignored by flushing them to 0 (via hardware). Divide by 0 is generally enabled since it usually indicates a fatal error in the program.

How the compiler can take advantage of disabled exceptions is best seen through an example. Consider a program which is known to not make invalid memory references. It will run fine whether memory exceptions are enabled or disabled. But if they are disabled and the compiler knows this, it can generate code which may speculatively make references to potentially invalid memory locations, e.g., reading beyond the end of an array. Such code can run faster than non-speculative code on a superscalar architecture if it takes advantage of functional units which would otherwise be idle. In addition, the compiler can often reduce the amount of software pipeline winddown code when it can make such speculative memory operations; this allows more of the loop iterations to be spent in the steady state code, leading to greater performance.

Speculative code generated by the compiler results in instructions moving in front of a branch which previously had been protecting them from causing exceptions. For example, accesses to memory elements beyond the end of an array are normally protected by the test at the end of the loop. Speculative code can be even riskier. For instance, a code sequence such as

      if (x .ne. 0.0) b = 1.0/x

might be changed into

      tmp = 1.0/x
      if (x .ne. 0.0) b = tmp

Thus, division by 0 is no longer protected by the branch. Allowing such speculation is the only way to get some loops to software pipeline. On an in-order superscalar processor such as the R8000, speculation is critical to achieving top performance on loops containing branches. To allow this speculation to be done more freely by the compiler, most exceptions are disabled by default on R8000-based machines.

For the R10000, however, speculation is also done in the hardware. Instructions can be executed out of order, and branches are speculatively executed automatically. Any exceptions encountered via hardware speculation which would not normally occur in the program are dismissed without operating system intervention. Thus, most exceptions are enabled by default in R10000-based systems. Due to the hardware speculation in the R10000, software speculation by the compiler is less critical from a performance point of view. Moreover, it can cause problems since exceptions encountered on a code path generated by the compiler will not simply be dismissed. In the best case, the operating system will need to intervene; in the worst case, such operations can lead to NaNs entering the calculation stream or the program aborting.

For these reasons, when generating code for the R10000, the compiler by default only performs the safest speculative operations. Which operations it will speculate on are controlled by the -TENV:X=n flag, where n has the following meanings as described by the f77(1) man page:

-TENV:X=0..4
    Specify the level of enabled exceptions that will be assumed
    for purposes of performing speculative code motion (default
    level 1 at -O0..-O2, 2 at -O3).  In general, an instruction
    will not be speculated (i.e. moved above a branch by the
    optimizer) unless any exceptions it might cause are disabled
    by this option.  At level 0, no speculative code motion may
    be performed.  At level 1, safe speculative code motion may
    be performed, with IEEE-754 underflow and inexact exceptions
    disabled.  At level 2, all IEEE-754 exceptions are disabled
    except divide by zero.  At level 3, all IEEE-754 exceptions
    are disabled including divide by zero.  At level 4, memory
    exceptions may be disabled or ignored.

    NOTE: At levels above the default level 1, various hardware
    exceptions which are normally useful for debugging, or which
    are trapped and repaired by the hardware, may be disabled or
    ignored, with the potential of hiding obscure bugs.  The
    program should not explicitly manipulate the IEEE floating-
    point trap-enable flags in the hardware if this option is
    used.

The default is -TENV:X=1 for optimization levels of -O2 or below, and -TENV:X=2 for -O3 or -Ofast. Programs compiled with higher values of n can suffer from excessive numbers of exceptions as a result of the software speculation. In many instances, the only negative effect this will have is on performance, where it shows up as unusually high system time. In other cases, such as when divides by zero are performed, NaNs and Infinities can propogate through the calculations ruining the results. The solution to both these problems is to recompile with a lower value of n.

Although higher than default -TENV:X values can provide a modest increase in performance for programs which one has verified do not generate exceptions, we in general recommended using the default values.

Important Note: If you have compiled the program with a higher than default value for the -TENV:X flag, you should not use libfpe to enable any exceptions which the compiler now assumes are not enabled. This could very easily result in the program generating mysterious floating point exceptions. For example, if you compile with -TENV:X=3, the compiler assumes that floating point divide-by-zero exceptions are turned off. As a result, it may replace an innocent construct such as

if (x .ne. 0.0) b = 1.0/x

with the following speculative code

tmp = 1.0/x
if (x .ne. 0.0) b = tmp

which may divide by zero. If the divide-by-zero exception is disabled, this speculative code cannot cause problems since if tmp is ever infinity, it is not stored in b. However, if libfpe has been used to enable the divide-by-zero exception, your program will abort if the calculation of tmp divides by zero, and it could be pretty painful to figure out what happened.

3.4.4 Inter-Procedural Analysis

Most compiler optimizations work within a single procedure (function, subroutine, etc.) at a time. This helps keep the problems manageable, and is a key aspect of supporting separate compilation, because it allows the compiler to restrict attention to the current source file.

However, this intra-procedural focus also presents serious restrictions. By avoiding dependence on information from other procedures, an optimizer is forced to make worst-case assumptions about the possible effects of those procedures. For instance, at boundary points, including all procedure calls, it must typically save (and/or restore) the state of all variables to (from) memory.

By contrast, inter-procedural analysis (IPA) algorithms analyze more than a single procedure --- preferably the entire program --- at once. As a result, they perform optimizations across procedure boundaries, and these results are provided to subsequent optimization phases (such as loop nest optimizations or software pipelining) so that their optimizations benefit.

The optimizations performed by the MIPSpro 7.x compilers's IPA facility include:

Inlining: Calls to a procedure are replaced inline by a suitably modified copy of the called procedure's body, even if the callee is in a different source file.
Common block array padding: Global arrays in Fortran may be padded, i.e. their size increased in order to reduce cache conflicts.
Constant propagation: Formal parameters which always have a particular constant value can be replaced by the constant, allowing additional optimization. Global variables which are initialized to constant values and never modified can be replaced by the constant.
Dead function elimination: Functions which are never called can be removed from the program image, improving memory utilization.
Dead variable elimination: Variables which are never actually used can be eliminated, along with any code that initializes them.
Global name optimizations: Global names in shared code must normally be referenced via addresses in a global table, in case they are defined or preempted by another DSO (see dso(5)). If the compiler knows it is compiling a main program and that the name is defined in another of the source files comprising the main program, an absolute reference can be substituted, eliminating a memory reference.
Automatic -G number: IPA can optimize the data items which can be referenced by simple offsets from the GP register instead of depending on the user to provide an ideal value of the -G option (see cc(1) or f77(1)).

IPA is easy to use. All you need do is use an optimization level of -O2 or -O3 and add -IPA to both the compile and link steps when building your program:

% cc -IPA ... -c *.c
% cc -IPA -o program *.o

Alternatively, you may use -Ofast for both compile and link steps since this optimization flag enables IPA.

If there are modules which you would prefer to compile using an optimization level of -O1 or -O0, that's fine. They will link into the program with no problems, but no inter-procedural analysis will be performed on them.

IPA works by postponing much of the compilation process until the link step, when all of the program components can be analyzed together. Schematically, this looks like the following:

The compile step does initial processing of the source file, placing an intermediate representation of the procedures it contains into the output .o file instead of normal relocatable code. Such object files are called WHIRL objects to distinguish them from normal relocatable object files. This processing is really two phases: the normal front end language processing (fe), plus an IPA summary phase which collects local information which will be used by IPA later. Since most back end (be) options (i.e., any of the standard optimization flags, -O3, -OPT:IEEE_arithmetic=3, etc.) are transmitted via the IPA summary phase, they must be present on the command line for the compile step.

The link step, although it is still invoked by a single ld(1) command, becomes several steps. First, the linker invokes IPA, which analyzes and transforms all of the input WHIRL objects together, writing modified versions of the objects. Then it invokes the compiler back end on each of the modified objects, producing a normal relocatable object. Finally, it invokes the linker again for a normal linkage step, producing an executable program or DSO.

Unlike other IPA implementations, the MIPSpro compiler does not require that all of the components of a program be subjected to IPA analysis. The ld(1) command which invokes IPA may include object files (.o) and archives (.a) which have been compiled normally without IPA analysis, as well as referencing DSOs which, however they were compiled, cannot contribute detailed information because the DSO may be replaced with a different implementation after the program is built.

The temporary files created during the expanded link step are all created in a temporary subdirectory of the output program's directory; like other temporary files produced by the compilation process, they are normally removed on completion (unless the -keep option is specified).

When using IPA, the proportion of time spent in compiling and linking will change. You will typically find that the compile step is faster with IPA than without; this is because the majority of the optimizations have been postponed until the link step. Conversely, the link step will take longer than when IPA is not used since it is now responsible for most of the compilation work. In addition, the total build time is likely to increase for two reasons: First, although the IPA step may not be very expensive itself, it usually increases the size of the code by inlining, and therefore expands the time required by the rest of the compiler. More importantly, since IPA analysis can propagate information from one module into optimization decisions in arbitrary other modules, even minor changes to a single component module will cause most of the program compilation to be redone. As a result, IPA is best suited for use on programs with a stable source base. (To avoid using IPA during code development, and hence save some re-compilation time, in the recommended flags replace -Ofast with -O3 -r10000 -OPT:roundoff=3, and for code with pointers, also add the appropriate aliasing flag: -OPT:alias=[typed|restrict|disjoint].)

3.4.4.1 Inlining

One of the key features of IPA is inlining. This is the replacement of a call to a subroutine or function with the code from that routine. This can improve performance in several ways:

By eliminating the subroutine call, the overhead in making the call (e.g., saving and restoring registers) is also eliminated.
A larger amount of code is exposed to later optimizatio phases of the compiler. So, for example, loops which previously would not software pipeline now can.
It eliminates aliases and resolves otherwise unknown side effects.

But there are some disadvantages to inlining:

Inlining expands the size of the program code, and so increases compilation time.
Increasing the context that various optimization phases look at puts a greater demand on the processor's registers and makes register allocation harder.
The resulting object text is larger.

Thus, inlining is does not always improve program execution time. A a result, there are flags you can use to control which routines will be inlined.

There are also some restrictions to inlining. The following types of routines cannot be inlined:

Recursive routines.
Routines using static (SAVE in Fortran) local variables.
Routines with variable argument lists (varargs).
Calls in which the caller and callee have mismatched argument lists.

IPA supports two types of inlining: manual and automatic. The manual (also known as standalone) inliner is invoked with the -INLINE option group. It has been provided to support the C++ inline keyword, but can also be used for C and Fortran programs using the command line option. It only inlines the routines which you specify, and they must be located in the same source file as the routine into which they are inlined. The automatic inliner is invoked whenever -IPA (or -Ofast) is used. It uses heuristics to decide which routines will be beneficial to inline, so you don't need to specify these explicitly. It can inline routines from any files which have been compiled with -IPA. We'll now look at how both inliners operate.

The manual inliner is controlled with the -INLINE option group. You need to use a combination of options to specify which routines to inline, and there are 4 options for doing so:

-INLINE:all: This option inlines all routines except for those which have been specified to never be inlined.

-INLINE:none: This option specifies that no routines be inlined except those you have specified must be inlined.

-INLINE:must=<name>[,<name2> ...]: This option specifies that the listed routines be inlined.

-INLINE:never=<name>[,<name2> ...]: This option specifies that the listed routines not be inlined.

As an example, consider the following code:

      subroutine zfftp ( z, n, w, s )
      real*8 z(0:2*n-1), w(2*(n-1)), s(0:2*n-1)
      integer n
c
      integer iw, i, m, l, j, k
      real*8 w1r, w1i, w2r, w2i, w3r, w3i
c
      iw = 1
c
c     First iteration of outermost loop over l&m: n = 4*lm.
c
      i = 0
      m = 1
      l = n/4
      if (l .gt. 1) then
         do j = 0, l-1
            call load4(w(iw),iw,w1r,w1i,w2r,w2i,w3r,w3i)
            call radix4pa(z(2*(j+0*l)),z(2*(j+1*l)),
     &                    z(2*(j+2*l)),z(2*(j+3*l)),
     &                    s(2*(4*j+0)),s(2*(4*j+1)),
     &                    s(2*(4*j+2)),s(2*(4*j+3)),
     &                    w1r,w1i,w2r,w2i,w3r,w3i)
         enddo
         i = i + 1
         m = m * 4
         l = l / 4
      endif
c
c       .
c       .
c       .
c
      return
      end

The loop (do j ...) contains two subroutine calls. These subroutines contain the following code:

c
      subroutine load4 ( w, iw, w1r, w1i, w2r, w2i, w3r, w3i )
c
      real*8 w(0:5), w1r, w1i, w2r, w2i, w3r, w3i
      integer iw
c
      w1r = w(0)
      w1i = w(1)
      w2r = w(2)
      w2i = w(3)
      w3r = w(4)
      w3i = w(5)
c
      iw = iw + 6
c
      return
      end
c
c------------------------------------------------------------------------------
c
      subroutine radix4pa ( c0, c1, c2, c3, y0, y1, y2, y3,
     &                      w1r, w1i, w2r, w2i, w3r, w3i )
c
      real*8 c0(0:1), c1(0:1), c2(0:1), c3(0:1)
      real*8 y0(0:1), y1(0:1), y2(0:1), y3(0:1)
      real*8 w1r, w1i, w2r, w2i, w3r, w3i
c
      real*8 d0r, d0i, d1r, d1i, d2r, d2i, d3r, d3i
      real*8 t1r, t1i, t2r, t2i, t3r, t3i
c
      d0r = c0(0) + c2(0)
      d0i = c0(1) + c2(1)
      d1r = c0(0) - c2(0)
      d1i = c0(1) - c2(1)
      d2r = c1(0) + c3(0)
      d2i = c1(1) + c3(1)
      d3r = c3(1) - c1(1)
      d3i = c1(0) - c3(0)

      t1r = d1r + d3r
      t1i = d1i + d3i
      t2r = d0r - d2r
      t2i = d0i - d2i
      t3r = d1r - d3r
      t3i = d1i - d3i

      y0(0) = d0r + d2r
      y0(1) = d0i + d2i
      y1(0) = w1r*t1r - w1i*t1i
      y1(1) = w1r*t1i + w1i*t1r
      y2(0) = w2r*t2r - w2i*t2i
      y2(1) = w2r*t2i + w2i*t2r
      y3(0) = w3r*t3r - w3i*t3i
      y3(1) = w3r*t3i + w3i*t3r

      return
      end

Both subroutines are straight-line code; inlining them will reduce function call overhead; more importantly, inlining will allow the j-loop to software pipeline.

If all the code is contained in the file fft.f, we can inline the calls with the following compilation command:

% f77 -n32 -mips4 -O3 -r10000 -OPT:IEEE_arithmetic=3:roundoff=3 \
        -INLINE:must=load4,radix4pa -flist -c fft.f

(Here, we have used -O3 -r10000 -OPT:roundoff=3 instead of -Ofast since we don't want -IPA to automatically inline the subroutines!) The -flist flag is used so that the compiler will write out a Fortran file showing what transformations the compiler has done. This file is named fft.w2f.f (where w2f stands for WHIRL-to-Fortran). One can confirm that the subroutines were inlined by checking the w2f file. It contains the code

        l = (n / 4)
        IF(l .GT. 1) THEN
          DO j = 0, l + -1, 1
            w1i = w((j * 6) + 2)
            w2r = w((j * 6) + 3)
            w2i = w((j * 6) + 4)
            w3r = w((j * 6) + 5)
            w3i = w((j * 6) + 6)
            d0r = (z((j * 2) + 1) + z((((l * 2) + j) * 2) + 1))
            d0i = (z((j * 2) + 2) + z((((l * 2) + j) * 2) + 2))
            d1r = (z((j * 2) + 1) - z((((l * 2) + j) * 2) + 1))
            d1i = (z((j * 2) + 2) - z((((l * 2) + j) * 2) + 2))
            d2r = (z(((l + j) * 2) + 1) + z((((l * 3) + j) * 2) + 1))
            d2i = (z(((l + j) * 2) + 2) + z((((l * 3) + j) * 2) + 2))
            d3r = (z((((l * 3) + j) * 2) + 2) - z(((l + j) * 2) + 2))
            d3i = (z(((l + j) * 2) + 1) - z((((l * 3) + j) * 2) + 1))
            t1r = (d1r + d3r)
            t1i = (d1i + d3i)
            t2r = (d0r - d2r)
            t2i = (d0i - d2i)
            t3r = (d1r - d3r)
            t3i = (d1i - d3i)
            s((j * 8) + 1) = (d0r + d2r)
            s((j * 8) + 2) = (d0i + d2i)
            s((j * 8) + 3) = ((w((j * 6) + 1) * t1r) -(w1i * t1i))
            s((j * 8) + 4) = ((w((j * 6) + 1) * t1i) +(w1i * t1r))
            s((j * 8) + 5) = ((w2r * t2r) -(w2i * t2i))
            s((j * 8) + 6) = ((w2r * t2i) +(w2i * t2r))
            s((j * 8) + 7) = ((w3r * t3r) -(w3i * t3i))
            s((j * 8) + 8) = ((w3r * t3i) +(w3i * t3r))
          END DO
        ENDIF
        RETURN

which shows that the two calls were inlined into the loop.

There are two other ways to confirm that the routines have been inlined:

During compilation, you can use the option -INLINE:list to instruct the compiler to print out the name of each routine as it is being inlined.
You can check the performance to verify that the loop software pipelines (you cannot check the assembler file since the -S turns off IPA).

Now let's look at IPA's automatic inlining. As the name implies, this is automatic. The only constraints are that the compiler will consider inlining to no longer be beneficial if it increases the size of the program by 100% or more, or if it the extra code causes the -Olimit to be exceeded. (This is the maximum size, in basic blocks, of a routine that the compiler will optimize. Generally, one does not worry about this limit since if it is exceeded, the compiler will print a warning mesage instructing you to to recompile using the flag -Olimit=nnnn. It is described in the reference pages for cc(1) and f77(1).)

If one of these limits is exceeded, IPA will restrict the inlining rather than simply not doing any at all. It decides which routines will remain inlined based on their location in the call hierarchy. Routines at the lowest depth in the call graph are inlined first, followed by routines higher up in the hierarchy. This is illustrated below:

Here, the call tree for a small program is shown on the left. At the lowest depth, n=0, leaf routines (i.e., those which do not call any other routines) are inlined; in this case, the routine E is inlined into routines D and C. At the next level up, n=1, D is inlined into B and C is inlined into A. At the n=2 level, which is the top level for this program, B is inlined into A. IPA will stop automatic inlining at the level which causes the program size to double, or for which the -Olimit is exceeded.

You can increase or decrease the amount of automatic inlining via three -IPA options:

-IPA:space=n: This flag specifies the largest acceptable per centage increase in code size that may result from inlining. The default value is n=100.
-IPA:forcedepth=n: This flag speficies that all routines at depth <= n in the call graph must be inlined regardless of the space limit.
-IPA:inline=off turns off automatic inlining alltogether.

IPA's automatic inlining will inline the calls to load4 and radix4pa in the example above. But since most of IPA's actions are not performed until the link stage, you won't get a w2f file with which to verify this until the link has been completed:

% f77 -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -flist -c fft.f
% f77 -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -flist -o fft fft_main.f fft.o
//usr/lib32/cmplrs/be translates fft.I into fft.w2f.f, based on source fft.I

--- fft_main.I ---
//usr/lib32/cmplrs/be translates fft_main.I into fft_main.w2f.f, based on source fft_main.I
% f77 -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -flist -c fft2.f
% f77 -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -flist -o fft2 fft2.o
//usr/lib32/cmplrs/be translates fft2.I into fft2.w2f.f, based on source fft2.I

Here, fft_main.f is just a simple main routine which calls zfftp once with the fixed value of n=1024 and then prints out some results (so that the compiler doesn't optimize away the code!) In this case, not only are load4 and radix4p inlined into zfftp, but zfftp is also inlined into the main routine.

If for some reason we wanted to prevent zfftp from being inlined into the main routine, the -IPA:space and -IPA:forcedepth flags won't be able to help us here. All the subroutine calls in this tiny example are made just once, so the code is not expanded in size by the inlining. Thus, restricting the space expansion doesn't accomplish what we want. However, the -INLINE:must and -INLINE:never flags work in conjunction with the automatic inlining. So to prevent zfftp from being inlined, we can use the flag -INLINE:never=zfftp:

% f77 -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -flist \
        -INLINE:never=zfftp -o fft fft_main.f fft.o

--- fft.I ---
//usr/lib32/cmplrs/be translates fft.I into fft.w2f.f, based on source fft.I

--- fft_main.I ---
//usr/lib32/cmplrs/be translates fft_main.I into fft_main.w2f.f, based on source fft_main.I

Now, fft.w2f.f is similar to what was obtained above using the standalone inliner, but due to the constant propogation performed by IPA, it is somewhat streamlined:

        n = 1024
        DO j = 0, 256 + -1, 1
          w1i = w((j * 6) + 2)
          w2r = w((j * 6) + 3)
          w2i = w((j * 6) + 4)
          w3r = w((j * 6) + 5)
          w3i = w((j * 6) + 6)
          d0r = (z((j * 2) + 1) + z((j * 2) + 1025))
          d0i = (z((j * 2) + 2) + z((j * 2) + 1026))
          d1r = (z((j * 2) + 1) - z((j * 2) + 1025))
          d1i = (z((j * 2) + 2) - z((j * 2) + 1026))
          d2r = (z((j * 2) + 513) + z((j * 2) + 1537))
          d2i = (z((j * 2) + 514) + z((j * 2) + 1538))
          d3r = (z((j * 2) + 1538) - z((j * 2) + 514))
          d3i = (z((j * 2) + 513) - z((j * 2) + 1537))
          t1r = (d1r + d3r)
          t1i = (d1i + d3i)
          t2r = (d0r - d2r)
          t2i = (d0i - d2i)
          t3r = (d1r - d3r)
          t3i = (d1i - d3i)
          s((j * 8) + 1) = (d0r + d2r)
          s((j * 8) + 2) = (d0i + d2i)
          s((j * 8) + 3) = ((w((j * 6) + 1) * t1r) -(w1i * t1i))
          s((j * 8) + 4) = ((w((j * 6) + 1) * t1i) +(w1i * t1r))
          s((j * 8) + 5) = ((w2r * t2r) -(w2i * t2i))
          s((j * 8) + 6) = ((w2r * t2i) +(w2i * t2r))
          s((j * 8) + 7) = ((w3r * t3r) -(w3i * t3i))
          s((j * 8) + 8) = ((w3r * t3i) +(w3i * t3r))
        END DO
        RETURN
        END ! zfftp

3.4.4.2 IPA Programming Hints

The -IPA option, whether listed explicitly on compile and link lines, or provided implicitly via -Ofast, is designed to work fairly automatically. Nevertheless, there are a few things you can do to enhance inter-procedural analysis.

Since IPA needs to know which routines are being called in order to do its analysis, programming constructs which obscure them will limit IPA. Function pointers, virtual functions and C++ exception handling are all examples of things which will limit inter-procedural analysis.

In general, the non-inlining optimizations that IPA performs should always benfit runtime performance. On the other hand, while inlining can often significantly improve performance, it can also hurt it, so it may pay to experiment with the flags available for controlling automatic inlining.

3.5 Step 5: The Cache Hierarchy and Loop Nest Optimizations

For a cache-based system such as the Origin2000, the optimizations which have the greatest potential for significant performance gains are those which improve the program's utilization of the cache hierarchy. In the MIPSpro 7.x compilers, this class of optimzations is known as loop nest optimizations, and they are performed by the loop nest optimizer, or LNO. This part of the compiler is capable of code transformations which can provide startling improvements in performance. Typically, the transformations the LNO does are confined to relatively localized sections of code --- loop nests --- but when combined with IPA, these optimizations can cross subroutine boundaries.

Naturally, some performance problems are symptomatic of poor algorithm design and can only be remedied by a programmer with a knowledge of the way caches work. So the LNO is not capable of solving all cache problems. Nevertheless, it can solve many, and a programmer who is aware of what it is capable of doing can use it as a tool to help in implementing more sophistocated cache optimzations.

In the sections which follow, we will describe the capabilities of the loop nest optimizer. In order to provide a basic understanding of the optimizations the LNO carries out, we will start with a description of caches and how they work, followed by some general programming practices which ensure their efficient use. We will then describe how you can use the profiling tools to identify cache performance problems, and what programming techiniques can be used to solve these problems. With these fundamentals established, we will then walk through the features of the LNO, showing how each optimzation may be enabled or disabled to suit particular circumstances. Finally, we will present a detailed tuning example in which the LNO can be used to effect some sophisticated cache optimzations which cannot be done automatically.

But before we jump off into the details, we should point out that if you are using the flags recommended above, you are already using the LNO since it is enabled whenever the highest level of optimzation, -O3 or -Ofast, is used. For the majority of programs and users, this is a great benefit since the LNO is capable of automatically solving many cache use problems. Thus, even if you read no further, you have already gained from many of the optimzations we are about to describe. That said, we would still like to encourage you to continue reading and learn the material in this section. While in general they are beneficial, not all loop nest optimizations improve program performance all the time, so there will be instances in which you can improve program performance by turning off some or all of the LNO options. But note: even though we have stressed the effect of the loop nest optimizations on cache efficiency, the LNO is not independent of other parts of the compiler, and many of the optimizations it performs are not simply for improving cache behavior. For example, software pipelining benefits from several of the LNO transformations, so you'll need to consider some tradeoffs when fine-tuning these options.

3.5.1 Cache Basics

In order to understand why cache utilization problems occur, it is important to understand how the data caches in the R10000 function. As we saw earlier, the R10000 utilizes a 2-level cache hierarchy: there is a 32 KB primary cache and a 1 or 4 MB secondary cache. Both caches are 2-way set associative employing a least-recently-used (LRU) replacement strategy. Thus, any data element may reside in only two possible locations in either cache as determined by the low-order bits of its address. If two data are accessed which have the same low-order bits, copies of both can be maintained in the cache simultaneously. When a third datum with the same low-order bits in its address is accessed, it will take the place of one of the two previous data, the one which was least recently accessed.

In order to maintain a high bandwidth between the caches and memory, each location in the cache refers to a contiguous block, or line, of addresses. For the primary cache, the line size is 32 bytes; it is 128 bytes for the secondary cache. Thus, when a particular datum is copied into a cache, several nearby data are also copied into the cache.

A datum is copied into the cache when a reference is made to it and it is not already in the cache; this is called a cache miss. If the datum already is in the cache, this is a cache hit, and the value does not need to be copied in again. Programs sustain high performance by reusing data already resident in the caches since the latency --- i.e., the time to access data --- is far less for cache-resident data than for data which only reside in main memory. As we saw earlier, the latencies to the different parts of the memory hierarchy are:

registers: 0 cycles, where a cycle is either 1/195 MHz or 1/180 MHz, about 5 nanosecnds
first level cache: 2 or 3 clock cycles
second level cache: 8--10 clock cycles
main memory: ranges from 300--1100 nanseconds, depending on the number of router hops to the memory in which the data are stored

Thus, main memory accesses can take 60 to 200+ cycles, an order of magnitude longer than cache hits.

The processor can only make use of data in registers, and it can only load data into registers from the the primary cache. So all data must be brought into the primary cache before they can be used in calculations. The primary cache can only obtain data from the secondary cache, and all data in the primary cache must simultaneously be resident in the secondary cache. The secondary cache obtains its data from main memory, of which they are a proper subset.

Since there is a big difference in latency to different parts of the memory hierarchy, the system has hardware support to allow the latency to be overlapped with other work. Part of this support comes from the out-of-order execution capability of the R10000. This is sufficient to hide the latency from the first level cache, and usually from the second level cache as well. But to overlap the main memory latency, something more is required: data prefetch.

Prefetch instructions are part of the MIPS IV instruction set architecture (mips(4)). They allow data to be moved into cache prior to their use. If the data are requested far enough in advance, the latency can be completely hidden by other operations. Prefetch instructions operate similarly to load and store instructions: an address is specified and the datum at that address is accessed. The difference is that no register is given for the datum to be loaded into; it ends up in the cache. Since there is no syntax in standard languages such as C and Fortran to request a prefetch, one generally relies on the compiler to automatically insert prefetch instructions; indeed, this is one of the optimizations the LNO performs. However, compiler directives are also provided to allow a user to specify prefetches manually.

The functioning of the caches immediately leads to some straight-forward principles for utilizing the cache efficiently:

3.5.1.1 Stride-1 Accesses

Whenever possible, use stride-1 accesses to maximize cache line reuse. Since accessing a particular datum brings not only it into the cache, but also nearby values in the same line, those other values should be used as soon as possible to get maximal use out of the cache line. Consider the following loop nest:

      do i = 1, n
         do j = 1, n
            a(i,j) = b(i,j)
         enddo
      enddo

Here, the array b is copied to the array a one row at a time. Since Fortran stores matrices in column-major order, a new cache line must be brought in for each element of row i; this applies to both a and b. If n is large, the cache lines holding elements at the beginning of the row may be displaced from the cache before the next row is begun. This will cause cache lines to be loaded multiple times, once for each row they span.

Another way to perform this copy operation is to reverse the order of the loops:

      do j = 1, n
         do i = 1, n
            a(i,j) = b(i,j)
         enddo
      enddo

Using this ordering, the data are copied one column at a time. Thus, all data in the same cache line are accessed before the line can be displaced from the cache, so each line only needs to be loaded once.

3.5.1.2 Group Together Data Used at the Same Time

Consider the following code segement:

      d = 0.0
      do i = 1, n
         j = ind(i)
         d = d + sqrt(x(j)*x(j) + y(j)*y(j) + z(j)*z(j))
      enddo

This performs a calculation involving three vectors, x, y and z. Access to the elements of the vectors is through an index array, ind, so it is not likely that the accesses are stride-1. Thus, each iteration of the loop will cause a new line to be brought into the cache for each of the three vectors.

Now consider the following modification in which x, y and z have been grouped together into one two-dimensional array, r:

      d = 0.0
      do i = 1, n
         j = ind(i)
         d = d + sqrt(r(1,j)*r(1,j) + r(2,j)*r(2,j) + r(3,j)*r(3,j))
      enddo

Since r(1,j), r(2,j) and r(3,j) are contiguous in memory, it is likely that all three will fall in the same cache line. Thus, one cache line, rather than three, needs to be brought in to carry out each iteration's calculation; this reduces traffic to the cache by up to a factor of three.

If the loop had instead been

      d = 0.0
      do i = 1, n
         d = d + sqrt(x(i)*x(i) + y(i)*y(i) + z(i)*z(i))
      enddo

then the accesses would have been stride-1 and there would be no advantage to grouping together the vectors into one two-dimensional array unless they were aligned in memory so that x(i), y(i) and z(i) all mapped to the same location in the cache. The next section shows how such unfortunate alignments can happen.

3.5.1.3 Cache Thrashing and Array Padding

Consider the following code sample:

      parameter (max = 1024*1024)
      dimension a(max), b(max), c(max), d(max)

      do i = 1, max
         a(i) = b(i) + c(i)*d(i)
      enddo

Since all four vectors are declared one after the other, they will be allocated contiguously in memory. Each vector is 4 MB in size, so the low 22 bits of the addresses of a(i), b(i), c(i) and d(i) will be the same, and the vectors will map to the same cache location.

In order to perform the calculation in iteration i of this loop, a(i), b(i), c(i) and d(i) need to be resident in the primary cache. c(i) and d(i) are needed first to carry out the multiplication. They map to the same location in the cache, but both can be resident simultaneously since the cache is 2-way set associative. But in order to carry out the addition, b(i) needs to be in the cache. It maps to the same cache location, so the line containing c(i) is displaced to make room for it (here we assume that c(i) was accessed least recently). In order to store the result in a(i), the cache line holding d(i) must be displaced since a(i) maps to the same cache location as the three previous values.

Now the loop proceeds to iteration i+1. As before, c(i+1) and d(i+1) must be cache-resident, but there will be no cache line reuse. The line containing c(i+1) must be loaded anew into the cache since either

c(i+1) is in a different line than c(i) and so its line must be loaded for the first time, or
c(i+1) is in the same line as c(i), and this line was displaced from the cache during the previous iteration.

Similarly, the cache line holding d(i+1) must also be reloaded. In fact, each reference to a vector element results in a cache miss since only 2 of the four values needed during each iteration can reside in the cache simultaneously. So even though the accesses are stride-1, there is no cache line reuse.

This behavior is known as cache thrashing, and it results in very poor performance. The cause of the thrashing is the unfortunate alignment of the vectors: they all map to the same cache location. In this example the alignment is especially bad since max is large enough to cause thrashing in both primary and secondary caches. There are two ways to fix this problem:

Re-dimension the vectors so that their size is not a power of two. A new size which spaces the vectors out in memory so that a(1), b(1), c(1) and d(1) all map to different locations in the cache is ideal. For example, max = 1024*1024 + 32 would offset the beginning of each vector 32 elements, or 128 bytes, from each other. This is the size of a full (secondary) cache line, so each vector begins at a different cache address. All four values may now reside in the cache simultaneously, and complete cache line reuse is possible.
Introduce padding variables between the vectors in order to space out their beginning addresses. Ideally, each padding variable should be at least the size of a full cache line. Thus, the declaration
```
      dimension a(max), pad1(32), b(max), pad2(32),
     &          c(max), pad3(32), d(max)
```
eliminates cache thrashing.

We will see more examples of how to fix these and other cache problems later. But first we address how to determine when cache problems exist.

3.5.2 Use perfex and S`peedShop` to Identify Cache Problems

The different types of profiling supported by SpeedShop and the R10000 counters are the tools one can use to determine if there is a cache problem. To see how this works we return to the adi2 example used earlier in the discussion of profiling. The code for the main part of the calculations in this example (cf. adi2.f) is

      dimension a(128,128,128)

      do k = 1, nz
         do j = 1, ny
            call xsweep(data(1,j,k),1,nx)
         enddo
      enddo
      do k = 1, nz
         do i = 1, nx
            call ysweep(data(i,1,k),ldx,ny)
         enddo
      enddo
      do j = 1, ny
         do i = 1, nx
            call zsweep(data(i,j,1),ldx*ldy,nz)
         enddo
      enddo

df k = 1, nz do j = 1, ny call Xsweep(a(1,j,k),nx,ny,nz) enddo enddo do k = 1, nz do i = 1, nx call Ysweep(a(i,1,k),nx,ny,nz) enddo enddo do j = 1, ny do i = 1, nx call Zsweep(a(i,j,1),nx,ny,nz) enddo enddo

Operations are performed along pencils in each of the three dimensions:

Previously, profiles using PC sampling, ideal time profiling and the R10000 counters were presented for this code. The pc sampling output is

samples   time(%)      cum time(%)      procedure (dso:file)

   6688   6.7s( 78.0)  6.7s( 78.0)         zsweep (adi2:adi2.f)
    671  0.67s(  7.8)  7.4s( 85.8)         xsweep (adi2:adi2.f)
    662  0.66s(  7.7)    8s( 93.6)         ysweep (adi2:adi2.f)
    208  0.21s(  2.4)  8.2s( 96.0)       fake_adi (adi2:adi2.f)
    178  0.18s(  2.1)  8.4s( 98.1)         irand_ (/usr/lib32/libftn.so:../../libF77/rand_.c)
    166  0.17s(  1.9)  8.6s(100.0)          rand_ (/usr/lib32/libftn.so:../../libF77/rand_.c)
      1 0.001s(  0.0)  8.6s(100.0)      __oserror (/usr/lib32/libc.so.1:oserror.c)

   8574   8.6s(100.0)  8.6s(100.0)          TOTAL

This shows that the vast majority of time is spent in the routine zsweep. However, the ideal time profile output presents quite a different picture:

        cycles(%)  cum %     secs    instrns    calls procedure(dso:file)

  68026368(23.79)  23.79     0.35   68026368 32768 xsweep(adi2.pixie:adi2.f)
  68026368(23.79)  47.59     0.35   68026368 32768 ysweep(adi2.pixie:adi2.f)
  68026368(23.79)  71.38     0.35   68026368 32768 zsweep(adi2.pixie:adi2.f)
  35651584(12.47)  83.85     0.18   35651584 2097152 rand_(./libftn.so.pixn32:../../libF77/rand_.c)
  27262976( 9.54)  93.39     0.14   27262976 2097152 irand_(./libftn.so.pixn32:../../libF77/rand_.c)
  18874113( 6.60)  99.99     0.10   18874113     1 fake_adi(adi2.pixie:adi2.f)
     11508( 0.00)  99.99     0.00      11508     5 memset(./libc.so.1.pixn32:/slayer_xlv0/ficussg-nov05/work/irix/lib/libc/libc_n32_M4/strings/bzero.s)
      3101( 0.00)  99.99     0.00       3101    55 __flsbuf(./libc.so.1.pixn32:_flsbuf.c)
      2446( 0.00) 100.00     0.00       2446    42 x_putc(./libftn.so.pixn32:../../libI77/wsfe.c)
      1234( 0.00) 100.00     0.00       1234     2 x_wEND(./libftn.so.pixn32:../../libI77/wsfe.c)
      1047( 0.00) 100.00     0.00       1047     1 f_exit(./libftn.so.pixn32:../../libI77/close.c)
      1005( 0.00) 100.00     0.00       1005     5 fflush(./libc.so.1.pixn32:flush.c)

This profile indicates that xsweep, ysweep and zsweep should all take the same amount of time. The difference between pc sampling and ideal time is that by taking samples of the program counter position as the program runs, pc sampling obtains an accurate measure of the time associated with memory accesses. Ideal time simply counts the the instructions which access memory and thus must tally the same amount of time for a cache hit as a cache miss.

This means that the differences we see between the pcsamp and ideal profiles indicate which routines have cache problems. Thus, zsweep is suffering from some serious cache problems, but we have no indication which cache --- primary, secondary or TLB --- is the principal source of the troubles. The only way to determine this is by using the R10000 event counters.

Earlier, the perfex output for this program was presented. Below are the counter values and "typical" times for the total time and the events relating to cache misses:

 0 Cycles.........................................  1639802080   8.366337     8.366337     8.366337
26 Secondary data cache misses....................     7736432   2.920580     1.909429     3.248837
23 TLB misses.....................................     5693808   1.978017     1.978017     1.978017
 7 Quadwords written back from scache.............    61712384   1.973562     1.305834     1.973562
25 Primary data cache misses......................    16368384   0.752445     0.235504     0.752445
22 Quadwords written back from primary data cache.    32385280   0.636139     0.518825     0.735278

These results make it clear that it is secondary data cache and TLB misses which are hindering the performance of zsweep. In order to understand why, we need to look in more detail at what happens when this routine is called.

zsweep performs calculations along a pencil in the z-dimension. This is the index of the a array which varies most slowly in memory. The declaration for a shows that there are 128*128 = 16384 array elements, or 64 KB, between successive elements in the z-dimension. Thus, each element along the pencil maps to the same location in the primary cache. In addition, every thiry-second element maps to the same place in a 4 MB secondary cache (every eighth element for a 1 MB L2 cache). Since the pencil is 128 elements long, the data in the pencil map to 4 different secondary cache lines (16 for a 1 MB L2 cache). Since the caches are only two-way set associative, there will be cache thrashing just as in the cache thrashing example above.

As we saw before, cache thrashing is fixed by introducing padding. Since all the data are in a single array, we can only introduce padding by increasing the array bounds. Changing the array declaration to

      dimension a(129,129,128)

will provide sufficient padding. With these dimensions, array elements along a z-pencil all map to different secondary cache lines. [For a system with a 4MB cache, two elements, a(i,j,k1) and a(i,j,k2), map to the same secondary cache line if ((k2-k1)*129*129*(4B/word) mod (4MB/(2 sets)))/128 = 0.] Running this version of the program (adi5.f) produces the following counts:

 0 Cycles.........................................   759641440   3.875722     3.875722     3.875722
23 TLB misses.....................................     5738640   1.993592     1.993592     1.993592
26 Secondary data cache misses....................     1187216   0.448186     0.293017     0.498559
 7 Quadwords written back from scache.............     7054080   0.225589     0.149264     0.225589
25 Primary data cache misses......................     3872768   0.178029     0.055720     0.178029
22 Quadwords written back from primary data cache.     7468288   0.146699     0.119645     0.169561

The secondary cache misses have been reduced by more than a factor of 6. The primary cache misses have also been greatly reduced. Now, TLB misses represent the dominant cache cost.

The reason for TLB misses to consume such a large percentage of the time is thrashing of that cache. Each TLB entry stores the mapping between an even-odd pair of virtual and physical memory pages. Each page is 16 KB in size, and adjacent elements in a z-pencil are separated by 129*129*4 = 65564 bytes, so a separate TLB entry is needed for each element. However, there are 128 elements per pencil and only 64 TLB entries, so the TLB must be completely reloaded (twice) for each sweep in the z-direction: there is no TLB cache reuse.

While not quite as easy as array padding, there are a couple of relatively simple ways to fix this problem: cache blocking and copying. Cache blocking is an operation which the LNO can perform automatically; it will be described soon. Copying is a trick which the LNO does not perform, but which is generally quite easy to do manually. Cache blocking can be used in cases where the work along z-pencils can be carried out independently on subsets of the pencil. But if the calculations along a pencil require a more complicated intermixing of data, or are performed in a library routine, copying will be the only general way to solve this problem.

The TLB thrashing results from having to work on data which are spread out over too many memory pages. This is not a problem for the x- and y-dimensions since these data are close enough together to allow many sweeps to reuse the same TLB entries. What is needed is a way to achieve the same kind of reuse for the z-dimension. This can be accomplished by copying z-pencils to a scratch array which is small enough to avoid TLB thrashing, carrying out the z-sweeps on the scratch array, and then copying the results back to the a array.

How big should this scratch array be? In order to avoid thrashing, it needs to be smaller than the number of pages which the TLB can map simultaneously; this is approximately 58*2 pages (IRIX reserves a few TLB entries for its own use), or 1856 KB. But it shouldn't be too small since the goal is to get a lot of reuse from the page mappings while they are resident in the TLB. The calculations in both the x- and y-dimensions achieve good performance. This indicates that an array the size of an xz-plane should work well. An xz-plane is used rather than a yz-plane in order to maximize cache reuse in the x-dimension. If an entire plane requires more scratch space than is practical, a smaller array can be used, but in order to get good secondary cache reuse, a subset of an xz-plane whose x-size is a multiple of a cache line should be used.

Running a version of the program modified to copy entire xz-planes to a scratch array before z sweeps are performed (adi53.f) produced the following counts:

 0 Cycles.........................................   505106704   2.577075     2.577075     2.577075
25 Primary data cache misses......................    14498336   0.666480     0.208599     0.666480
22 Quadwords written back from primary data cache.    24018944   0.471801     0.384793     0.545328
26 Secondary data cache misses....................      866976   0.327292     0.213978     0.364078
 7 Quadwords written back from scache.............     6643584   0.212462     0.140578     0.212462
23 TLB misses.....................................       75504   0.026230     0.026230     0.026230

While the copying has increased the primary and secondary cache misses somewhat, the TLB misses have all but disappeared and the overall performance is significantly improved. In cases where it is possible to use cache blocking instead of copying, the TLB miss problem can be solved without increasing the primary and secondary cache misses, and without expending time to do the copying, so it will provide even better performance. Use copying when cache blocking is not an option.

3.5.3 Standard Techniques for Improving Cache Performance

Through the use of perfex, PC sampling, ideal time profiling, cache problems can easily be identified. Padding and copying are two techniques which can be used to remedy them. We now discuss some other useful methods for improving cache performance.

3.5.3.1 Loop Fusion

Consider the following example:

      dimension a(N), b(N)

      do i = 1, N
         a(i) = a(i) + alpha*b(i)
      enddo
      dot = 0.0
      do i = 1, N
         dot = dot + a(i)*a(i)
      enddo

Here, the two loops carry out two vector operations. The first is the DAXPY operation we have seen before; the second is a DOT product. If N is large enough that the vectors don't fit in cache, this code will stream the vector b from memory through the cache once, and the vector a twice: once to perform the DAXPY operation, and a second time to calculate the DOT product. Since the memory latency is large (60 cycles for local memory if prefetch is not used) compared to the time it takes to do the arithmetic (one cycle per element per loop), this code will take about 3*60*N cycles to run.

Now consider the following modification:

      dimension a(N), b(N)

      dot = 0.0
      do i = 1, N
         a(i) = a(i) + alpha*b(i)
         dot  = dot + a(i)*a(i)
      enddo

In this version, the two vector operations have been combined into one loop. Now, the vector a only needs to be streamed through the cache once since its contribution to the dot product can be calculated while the cache line holding a(i) is still cache-resident from its use in the DAXPY operation. This version will take approximately 2*60*N cyles to complete, 1/3 less time than the original version.

Combining two loops into one is called loop fusion. There is one potential drawback to fusing loops, however: it makes the body of the fused loop larger. If the loop body gets too large, it can impede software pipelining. The LNO tries to balance the conflicting goals of better cache behavior with efficient software pipelining.

3.5.3.2 Cache Blocking

Another standard technique used to improve cache performance is cache blocking, or tiling. Here, data structures which are too big to fit in cache are broken up into smaller pieces which will fit in the cache.

As an example, consider matrix multiplication:

      do j = 1, n
         do i = 1, m
            do k = 1, l
               c(i,j) = c(i,j) + a(i,k)*b(k,j)
            enddo
         enddo
      enddo

If A, B and C all fit in cache, performance is great. But if they are too big, performance drops substantially as can be seen from the following table of results:

  m     n     p    lda   ldb   ldc    Seconds   MFLOPS
======================================================
   30    30    30    30    30    30    0.000162  333.9
  200   200   200   200   200   200    0.056613  282.6
 1000  1000  1000  1000  1000  1000   25.431182   78.6

(These results were obtained with LNO turned off, -LNO:opt=0, since the LNO is designed to fix these problems!)

The reason for the drop in performance is clear if one uses perfex on the 1000x1000 case:

 0 Cycles.........................................  5460189286  27.858109
23 TLB misses.....................................    35017210  12.164907
25 Primary data cache misses......................   200511309   9.217382
26 Secondary data cache misses....................    24215965   9.141769

This problem with cache misses can be fixed by blocking the matrices as shown below:

Here, one block of C is updated by multiplying together a block of A and a block of B, so the blocks must be small enough so that one from each matrix can fit into the cache at the same time.

Why does this improve performance? To answer this question, we need to count the number of times each matrix is touched. From looking at the unblocked code, we can see that each element of C is touched just once: c(i,j) is read in from memory, the dot product of the row vector a(i,:) with the column vector b(:,j) is added to c(i,j), and the result is stored back to memory. But calculating an entire column of C will require reading all the rows of A, so A is touched n times, once for each column of C, or equivalently each column of B. Similarly, every column vector b(:,j) must be reread for each dot product with a row of A, so B is touched m times.

If A and B don't fit in the cache, there is likely to be very little reuse between touches, so these accesses will require streaming data in from main memory.

Now consider what happens if the matrices are blocked. A block of C is calculated by taking the dot product of a block-row of A with a block-column of B. The dot product consists of a series of sub-matrix multiplies. If three blocks, one from each matrix, all fit in cache simultaneously, the elements of those blocks only need to be read in from memory once for each sub-matrix multiply. Thus, A will now only need to be touched once for each block-column of C, and B will only need to be touched once for each block-row of A. This reduces the memory traffic by the size of the blocks.

Of course, the bigger the blocks are, the greater the reduction in memory traffic, but only up to a point. Three blocks must be able to fit into cache at the same time, so there is an upper bound on how big the blocks can be. For example, three 200x200 double precision blocks take up almost 1 MB, so this is about as big as the blocks can be on a 1 MB cache system. Furthermore, the blocks cannot conflict with themsleves or each other; that is, different elements of the blocks must not map to the same cache lines or it will not be true that they only need to be read in from memory once for each sub-matrix multiply. Conflicts can be reduced or eliminated by proper padding of the leading dimension of the matrices, careful alignment of the matrices in memory with respect to each other, and, if necessary, further limiting of the block size.

One other factor affects what size blocks can be used: the TLB cache. The larger we make the blocks, the greater the number TLB entries which will be required to map their data. If the blocks are too big, the TLB cache will thrash. For small matricies, in which a pair of pages spans multiple columns of a sub-matrix, this is not much of a restriction. But for very large matricies, this means that the width of a block will be limited by the number of available TLB entries. (The sub-matrix multiplies can be carried out so that data accesses go down the coumns of two of the three sub-matricies and across the rows of only one sub-matrix. Thus, the number of TLB entries only limits the width of one of the sub-matricies.) This is just one more detail which must be factored into how the block sizes are chosen.

Let's return now to the 1000x1000 matrix multiply example. Each column is 8000 bytes in size. Thus, (1048576/2)/8000 = 65 columns will fit in one set of a 1 MB secondary cache, and 130 will fill both sets. In addition, each TLB entry will map 4 columns, so the block size is limited to about 225 before TLB thrashing will start to become a problem. Here are the performance results for 3 different block sizes:

                                    block                                     
  m     n     p    lda   ldb   ldc  order  mbs   nbs   pbs    Seconds   MFLOPS
==============================================================================
 1000  1000  1000  1000  1000  1000  ijk     65    65    65    6.837563  292.5
 1000  1000  1000  1000  1000  1000  ijk    130   130   130    7.144015  280.0
 1000  1000  1000  1000  1000  1000  ijk    195   195   195    7.323006  273.1

For block sizes which limit cache conflicts, the performance of the blocked algorithm on this larger problem matches the performance of the unblocked code on a size which fits entirely in the secondary cache (the 200x200 case above). The perfex counts for the block size of 65 are

 0 Cycles......................................... 1406325962   7.175132
21 Graduated floating point instructions.......... 1017594402   5.191808
25 Primary data cache misses......................   82420215   3.788806
26 Secondary data cache misses....................    1082906   0.408808
23 TLB misses.....................................     208588   0.072463

The secondary cache misses have been greatly reduced and the TLB misses are now almost non-existent. From the times perfex has estimated, the primary cache misses look to be a potential problem. But perfex's estimate assumes no overlap of primary cache misses with other work. The R10000 generally manages to achieve a lot of overlap on these misses, so this estimated time is an overestimate. This can be confirmed by looking at the times of the other events and comparing them to the total time. The inner loop of matrix multiply generates one madd instruction per cycle and perfex estimates the time based on this same ratio, so the floating point time should be pretty accurate. Based on this alone, the primary cache misses can at most account for about 2 seconds. (This points out that when estimating primary cache misses, you have to make some attempt to account for overlap, either by making an educated guess or by letting the other counts lead you to reasonable bounds.)

For convenience, we have used square blocks in the example above. But since there is some overhead to starting up and winding down software pipelines, rectangular blocks, in which the the longest dimension corresponds to the inner loop, will generally perform the best.

Cache blocking is one of the optimizations the LNO performs. It chooses block sizes which are as large as it can fit into the cache without conflicts. In addition, it uses rectangular blocks to reduce software pipeline overhead and maximize data use from each cache line. If we simply compile the matrix multiply kernel with the recommended flags (which enable LNO),

f77 -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -c matmul.f

here's the performance we get:

  m     n     p    lda   ldb   ldc    Seconds   MFLOPS
======================================================
 1000  1000  1000  1000  1000  1000    7.199939  277.8

Not bad for doing absolutely no work at all!

But lest you think that the LNO's automatic cache blocking will solve all possible cache problems, the cache conflicts that we'll see in the next section cannot be fixed automatically. For them, manual intervention is required.

3.5.3.3 Transposes

Now let's consider another example: fast fourier transforms (FFTs). These algorithms present challenges on a cache-based machine due to their data access patterns and since they usually involve data which are a power-of-two in size.

In a radix-2 implementation, there are log(n) computational stages. In each stage, pairs of values are combined. In the first stage, these data are n/2 elements apart; in the second stage, data which are n/4 elements apart are combined, and so on. Below is a diagram showing the three stages for an 8-point transform:

Since both the size of the cache and the separation between combined elements is a power of two, these data access patterns cause cache conflicts when n is large enough that the data do not fit in the primary cache. Since the R10000 caches are two-way set associative, the 2-way conflicts of the radix-2 algorithm will be resolved without thrashing. However, a radix-2 implementation is not particularly efficient. A radix-4 implementation, in which 4 elements are combined at each stage, is much better: it uses half the number of stages --- which means it makes half the memory accesses --- and it requires fewer arithmetic operations. But the 4-way conflicts cause thrashing. So how can we deal with this?

The trick is to view the data as a two-dimensional, rather than a one-dimensional, array. Viewed this way, half the stages become FFTs across the rows of the array, while the remainder are FFTs down the columns. The column FFTs are cache-friendly (assuming Fortran storage order), so nothing special needs to be done for them. But the row FFTs will suffer from thrashing. The easiest way to eliminate the trashing is through the use of copying just as was done above to cure TLB performance problems in the adi2 code. An alternative, which results in slightly better performance, is to transpose the two-dimensional array so that cache-thrashing row FFTs become cache-friendly column FFTs.

To carry out the transpose, the two-dimensional array is partitioned into square blocks. If the array itself cannot be chosen to be a square, it can be chosen to have twice as many rows as columns (or vice versa). In this case, it can be divided horizontally (vertically) into two squares; these two squares can be transposed independently.

The number of columns in a block is chosen to be that number which will fit in the cache without conflict (2-way conflicts are alright due to the 2-way set associativity). The square array is then transposed as follows: off-diagonal blocks in the first block-column and block-row are transposed and exchanged with each other. The exchanges should be done working from the blocks farthest from the diagonal towards those closest to it; in case there is not complete conflict between the blocks in each block row, this will leave some of the block adjacent to the diagonal in cache. Then the diagonal block is transposed. At this point, the entire first column will be in-cache. Before proceeding with the rest of the transpose, the FFTs down the first block-column are carried out on the in-cache data. Once these have been completed, the next block-column and block-row may be exchanged. These are one block shorter than in the previous exchange; as a result, once the exchange is completed, not all of the second block-column is guaranteed to be in-cache. But most of it is, so FFTs are carried on this second block-column. Exchanges and FFTs continue until all the "row" FFTs have been completed.

Now we are ready for the column FFTs, but there is a problem. The columns are now rows, and so the FFTs will cause conflicts. This is cured just as above: the array is transposed, but in order to take greatest advantage of the data currently in-cache, this transpose should proceed from right-to-left rather than from left-to-right. Once this transpose and the "column" FFTs have been completed, we are almost done. A side-effect of treating the one-dimensional FFT as a two-dimensional FFT is that the data end up transposed. Once again this is no problem since we can simply transpose them a third time.

In spite of the three transposes, for data which don't fit in L1, this algorithm is significantly faster than simply allowing the row FFTs to thrash the cache as can be seen from the following table of performance figures for a range of double precision FFT sizes:

                 Without Blocking/     With Blocking/ 
                         Transposes         Transposes
            n      Seconds   MFLOPS   Seconds   MFLOPS
       ===============================================
            2048   0.001035   108.8   0.000782   144.1
            4096   0.003011    81.6   0.001525   161.2
            8192   0.006785    78.5   0.003300   161.4
           16384   0.015205    75.4   0.006816   168.3
           32768   0.034830    70.6   0.015395   159.6
           65536   0.077111    68.0   0.030922   169.6

But what about data which do not even fit in the secondary cache? For these larger arrays, one blocks just as above. but if the FFTs down the columns of the transposed array don't fit in the primary cache, another level of blocking and transposing needs to be used for them. As a result, a well-tuned library routine for calculating an FFT will require three separate algorithms: a textbook radix-4 algorithm for the small data sets which fit in L1; a double-blocked algorithm for large data sets; and a single-blocked algorithm for the intermediate sizes. The FFTs in complib.sgimath utilize different algorithms in just this way.

3.5.3.4 Using Larger Page Sizes to Reduce TLB Misses

The adi2 program above provided an example of the performance problems that can arise when calculations incur a great number of TLB misses. The problem was solved by copying z-dimension data into a small scratch array to provide better reuse of the page table entries cached in the TLB. This is a general solution which will work in any situation. However, it does require modifying the code in order to achieve the mance improvement.

There is an alternate technique which can be used without code modification, namely, ask the operating system to use larger page sizes. Since each TLB entry will then represent more data, the number of TLB misses can be reduced. Of course, it one then attempts to work on larger problems, the TLB misses can return, so this technique is more of an expedient than a general solution. Nevertheless, if you can get something for free, it's worth looking into.

The page size can be changed in two ways: The utility dplace(1) can reset the page size for any program. Its use is very simple:

% dplace -data_pagesize 64k -stack_pagesize 64k -text_pagesize 64k \
        program

With this command, the data, stack and text pagesizes have all been set to 64 KB. Since TLB misses will typically come as a result of calculations on the data in the program, setting the pagesize for the program text (i.e., instructions) is probably not needed, but it often is convenient to simply reset all three page sizes rather than worrying about which one doesn't need to be made larger.

For parallel programs which use the MP library (-lmp), a second method exists. All one needs do is set some environment variables. For example, the following three lines

% setenv PAGESIZE_DATA  64
% setenv PAGESIZE_STACK 64
% setenv PAGESIZE_TEXT  64

set the pagesize for data, text and stack to 64 KB.

There are some restrictions, however, on what pagesizes can be used. First of all, the only valid page sizes are 16 KB, 64 KB, 256 KB, 1 MB, 4 MB and 16 MB. Secondly, IRIX 6.4 only allows two different pagesizes to be used system-wide at any particular time, and the two available sizes are chosen by the system administrator at boot time.

3.5.4 Loop Nest Optimizations

The loop nest optimizer, or LNO, attempts to improve cache and instruction scheduling by performing high level transformations on loops. Above, we have seen some of the transformations the loop nest optimizer can carry out (loop interchange, padding, loop fusion, cache blocking, prefetching). The LNO has been designed to make things simple for most users, yet flexible for power users: much of the time you can just accept the default actions of the LNO and get good results; but sophistocated users will sometimes want to fine tune what the LNO does. In the sections which follow, we will describe the options and directives which control the behavior of the loop nest optimizer.

3.5.4.1 Running the LNO

The LNO is turned on automatically when the highest level of optimzation, -O3 or -Ofast, is used. In order to be of general benefit to most programs, when the compiler cannot determine the size of data structures used in a program, it assumes they will not fit in cache and carries out the cache optimzations it is capable of. As a result, unnecessary optimzations may be carried out on programs which are already cache-friendly.

In order to allow you the ability to prevent such occurrences, LNO can be disabled completely by using the flag -LNO:opt=0. In addition, nearly all of the LNO's optimzations may be individually controlled via command line options and/or compiler directives, which we will describe shortly. The directives take precedence over the command line options, but they can be disabled with the flag -LNO:ignore_pragmas. Finally, to provide you with detailed information from which to decide whether or not the transformations are beneficial, the compiler can emit a transformation file showing how the LNO (and other parts of the compiler) have optimized the program.

3.5.4.2 Visualizing Transformations

Transformations performed by the LNO, IPA, and other components of the compiler may be visualized via a transformation file. This is a C or Fortran file emitted by the compiler which shows what the program looks like after the optimzations have been applied. It may be used in fine tuning which optimzations should be enabled or disabled for best performance.

One may cause a Fortran transformation file to be emitted by using the -flist or -FLIST:=ON flags, and a C file will be emitted if the -clist or -CLIST:=ON flags are used:

% f77 -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -FLIST:=on ...
% cc  -n32 -mips4 -Ofast -OPT:IEEE_arithmetic=3 -CLIST:=on ...

The transformation file is given the same name as the source file being compiled, but .w2f (.w2c) is inserted before the .f (.c). The compiler will issue a diagnostic message indicating that the transformation file has been written and what its name is:

% f77 -n32 -mips4 -r10000 -O3 -OPT:roundoff=3:IEEE_arithmetic=3 -flist -c matmul.f
//usr/lib32/cmplrs/be translates /tmp/ctmB.BAAa003Hz into matmul.w2f.f, based on source matmul.f

If IPA is enabled (e.g., when using -Ofast), the transformation file is not written until the link step.

In order to be as readable as possible, the transformation file uses as many of the variable names from the original source as it can. However, Fortran files will be written using a free format in which lines begin with a tab character and do not have to adhere to the 72-character limit. The MIPSpro compilers will compile this form correctly, but if you wish the file to be written using the standard 72-character Fortran layout, the -FLIST:ansi_format flag should be used:

% f77 -n32 -mips4 -r10000 -O3 -OPT:roundoff=3:IEEE_arithmetic=3 -FLIST:ansi_format -c matmul.f
//usr/lib32/cmplrs/be translates /tmp/ctmB.BAAa003WE into matmul.w2f.f, based on source matmul.f

As an example, you can follow these links to peruse filter1.w2f.f.default, the transformed source of a small kernel, filter1.f., as well as filter1.w2f.f.ansi_format, the ansi_format version.

In addition to the transformation file, recall that the compiler writes software pipelining report cards into the assembler file it emits when the -S option is used. The assembler file also contains a listing of the compilation flags.

3.5.4.3 Outer Loop Unrolling

One of the non-cache optimizations that the LNO performs is outer loop unrolling (sometimes called register blocking). Consider the following subroutine which implements a matrix multiplication:

      subroutine mm(a,lda,b,ldb,c,ldc,m,l,n)
      integer lda, ldb, ldc, m, l, n
      real*8 a(lda,l), b(ldb,n), c(ldc,n)
c
      do j = 1, n
         do k = 1, l
            do i = 1, n
               c(i,j) = c(i,j) + a(i,k)*b(k,j)
            enddo
         enddo
      enddo
c
      return
      end

The report card for this loop (assuming we compile with loop nest optimizations turned off) is

 #<swps>
 #<swps> Pipelined loop line 7 steady state
 #<swps>
 #<swps>     2 unrollings before pipelining
 #<swps>     6 cycles per 2 iterations
 #<swps>     4 flops        ( 33% of peak) (madds count as 2)
 #<swps>     2 flops        ( 16% of peak) (madds count as 1)
 #<swps>     2 madds        ( 33% of peak)
 #<swps>     6 mem refs     (100% of peak)
 #<swps>     2 integer ops  ( 16% of peak)
 #<swps>    10 instructions ( 41% of peak)
 #<swps>     1 short trip threshold
 #<swps>     6 ireg registers used.
 #<swps>     8 fgr registers used.
 #<swps>

This is just like the one we saw above for the DAXPY operation. It says that in the steady-state, this loop will achieve 33% of the peak floating point rate. Furthermore, the reason for this less than peak performance is that the loop is load/store-bound. But is it really?

Software pipelining is applied only to a single loop, i.e.,

      do i = 1, n
              .
              .
              .
      enddo

In the case of a loop nest, as above, this innermost loop is the one which is software pipelined. Thus, the inner loop is operated on as if it were independent of the loops surrounding it. But this is not the case in general, and it is certainly not the case for matrix multiply.

Consider the following loop nest:

      subroutine mm1(a,lda,b,ldb,c,ldc,m,l,n)
      integer lda, ldb, ldc, m, l, n
      real*8 a(lda,l), b(ldb,n), c(ldc,n)
c
      do j = 1, n, nb
         do k = 1, l
            do i = 1, n
               c(i,j+0) = c(i,j+0) + a(i,k)*b(k,j+0)
               c(i,j+1) = c(i,j+1) + a(i,k)*b(k,j+1) 
                                   .
                                   .
                                   .
               c(i,j+nb-1) = c(i,j+nb-1) + a(i,k)*b(k,j+nb-1)
            enddo
         enddo
      enddo
c
      return
      end

This is the same as the code above except that the outermost loop has been unrolled nb times. (In order be truly equivalent to that above, the upper limit of the j-loop should be changed to n-nb+1 and an additional loop nest needs to be added to take care of iterations leftover if nb does not divide n evenly. These details have been omitted for ease of presentation.)

The thing to note about this loop nest in comparison to the previous one, however, is that nb j-values are worked on in the inner loop and a(i,k) only needs to be loaded into a register once for all these calculations. Due to the order of operations in the previous loop nest, the same a(i,k) value needed to be loaded once for each of the c(i,j+0), ... c(i,j+nb-1) calculations, that is, a total of nb times. Thus, the outer loop unrolling has reduced by a factor of nb the number of times that elements of the a matrix need to be loaded.

Now consider another loop nest:

      subroutine mm2(a,lda,b,ldb,c,ldc,m,l,n)
      integer lda, ldb, ldc, m, l, n
      real*8 a(lda,l), b(ldb,n), c(ldc,n)
c
      do j = 1, n
         do k = 1, l, lb
            do i = 1, n 
               c(i,j) = c(i,j) + a(i,k+0)*b(k+0,j)
               c(i,j) = c(i,j) + a(i,k+1)*b(k+1,j)
                               .
                               .
                               .
               c(i,j) = c(i,j) + a(i,k+lb-1)*b(k+lb-1,j)
            enddo
         enddo
      enddo
c
      return
      end

Here, the middle loop has been unrolled lb times. Due to the unrolling, c(i,j) only needs to be loaded and stored once for all lb calculations in the inner loop. Thus, unrolling the middle loop reduces the loads and stores of the c matrix by a factor of lb.

If both outer and middle loops are unrolled simultaneously, then both of the load/store reductions will occur:

      

do j = 1, n, nb
         do k = 1, l, lb
            do i = 1, n
               c(i,j) = c(i,j+0) + a(i,k+0)*b(k+0,j+0)
               c(i,j) = c(i,j+0) + a(i,k+1)*b(k+1,j+0)
                                 .
                                 .
                                 .
               c(i,j) = c(i,j+0) + a(i,k+lb-1)*b(k+lb-1,j+0)

               c(i,j) = c(i,j+1) + a(i,k+0)*b(k+0,j+1)
               c(i,j) = c(i,j+1) + a(i,k+1)*b(k+1,j+1)
                                 .
                                 .
                                 .
               c(i,j) = c(i,j+1) + a(i,k+lb-1)*b(k+lb-1,j+1)

                                 .
                                 .
                                 .

               c(i,j) = c(i,j+nb-1) + a(i,k+0)*b(k+0,j+0)
               c(i,j) = c(i,j+nb-1) + a(i,k+1)*b(k+1,j+0)
                                 .
                                 .
                                 .
               c(i,j) = c(i,j+nb-1) + a(i,k+lb-1)*b(k+lb-1,j+nb-1)
            enddo
         enddo
      enddo

So how does this help us? The total operations in the inner loop of the above loop nest is

        madds:          lb * nb
        c loads:        nb
        c stores:       nb
        a loads:        lb
        b loads:        0 (if sufficient registers)

With no unrolling (nb = lb = 1), the inner loop is load/store bound. However, the number of madds grows quadratically with the unrolling, while the number of memory operations only grows linearly. By increasing the unrolling, it may be possible to convert the loop from load/store-bound to floating point-bound.

To accomplish this requires that the number of madds be at least as large as the number of memory operations. While a branch and a pointer increment must also be executed by the processor, the ALUs are responsible for these operations. For the R10000 there are plenty of superscalar slots for them to fit in, so they don't figure in the selction of the unrolling factors. (Note, though, this is not the case for R8000 since memory operations and madds together can fill up all the superscalar slots.)

Below is a table showing the number of madds and memory operations per replication for various values of lb and nb.

        lb      nb      madds   memops  comment
        1       1       1       3       M
        1       2       2       5       M
        1       3       3       7       M
        1       4       4       9       M
        1       5       5       11      M
        1       6       6       13      M
        2       1       2       4       M
        2       2       4       6       M
        2       3       6       8       M
        2       4       8       10      M
        2       5       10      12      M
        2       6       12      14      M
        3       1       3       5       M
        3       2       6       7       M
        3       3       9       9       *
        3       4       12      11      *
        3       5       15      11      *
        3       6       18      12      *
        4       1       4       6       M
        4       2       8       8       *
        4       3       12      10      *
        4       4       16      12      *
        4       5       20      14      -
        4       6       24      16      -
        5       1       5       7       M
        5       2       10      9       *
        5       3       15      11      *
        5       4       20      13      *
        5       5       25      15      -
        5       6       30      17      -
        6       1       6       8       M
        6       2       12      10      *
        6       3       18      12      *
        6       4       24      14      -
        6       5       30      16      -
        6       6       36      18      -

where

        M = memory-bound schedule
        * = optimal software pipeline
        - = suboptimal schedule due to
            register allocation problems

These results are summarized in the following figure:

So there are several blocking choices which will produce schedules achieving 100% of floating point peak. Keep in mind, though, that this is the steady-state performance of the inner loop for in cache data. There will be some overhead incurred in filling and draining the software pipeline, and this overhead may vary depending on which unrolling values are used. And if the arrays don't fit in the cache, the performance will be substantially less unless blocking is used

Outer loop unrolling is one of the optimizations which the LNO performs: it chooses the proper amount of unrolling for loop nests such as this matrix multiply kernel. For this particular case, if we compile with blocking turned off, the compiler chooses to unroll the j-loop by 2 and the k-loop by 4, achieving a perfect schedule. (With blocking turned on, the compiler also blocks the matrix multiply producing a more complicated transformation, so it is harder to rate the effectiveness of just the outer loop unrolling.)

While the LNO will generally choose the optimal amount of unrolling, several flags and a directive are provided for fine-tuning. The three principal flags which control outer loop unrolling are:

-LNO:outer_unroll=n This flag instructs the compiler to unroll every outer loop (for which unrolling is possible) by exactly n. The compiler will either unroll by this amount or not at all. If this flag is used, the next two flags cannot be used.
-LNO:outer_unroll_max=n This flag tells the compiler that it may unroll any outer loop by as many as n, but no more.
-LNO:outer_unroll_prod_max=n This flag indicates that the product of unrolling of the various outer loops in a given loop nest shall not exceed n. The default is 16.

These flags apply to all the loop nests in the file being compiled. To control the unrolling amount for individual loops, a directive may be used. For Fortran, it is

c*$* unroll (n)

and for C,

#pragma unroll (n)

If the loop that this directive immediately preceeds is not innermost, then outer loop unrolling is performed. The value of n must be at least one. If n=1, no unrolling is performed. If n ==0, the default unrolling is applied. If this directive immediately precedes an innermost loop, then standard loop unrolling will be done. This use is not recommended since the software pipeliner unrolls inner loops if it finds it beneficial; generally, the software pipeliner is the best judge of how much inner loop unrolling should be done.

3.5.4.4 Loop Interchange

We saw earlier that stride-1 accesses can improve cache reuse. Thus, the loop

for (i=0; i<n; i++) {
   for (j=0; j<m; j++) {
      a[j][i] = 0.0;
   }
}

is more efficient if the loops are interchanged:

for (j=0; j<m; j++) {
   for (i=0; i<n; i++) {
      a[j][i] = 0.0;
   }
}

This is a transformation which is automatically carried out by the loop nest optimizer. However, in optimizing such loop nests, the LNO has to consider more than just the cache behavior. What if n=2 and m=100? Then the array fits in cache and, with the original loop order, the code will achieve full cache reuse. However, interchanging the loops will put the shorter i loop inside. Software pipelining that loop will incur substantial overhead. So the compiler will have made the wrong choice in this case.

Let's consider another example:

for (i=0; i<n; i++) {
   for (j=0; j<m; j++) {
      a[j][i] = a[j][i] + a[j][i-1];
   }
}

Reversing the order of his loop will produce nice cache behavior, but the recursion in i will limit the performance software pipelining can achieve for the inner loop. The LNO needs to consider the cache effects, the instruction scheduling, and loop overhead when deciding whether it should reorder such a loop nest. (In this case, the LNO assumes that m and n are large enough that optimizing for cache behavior is the best choice.)

If you know that in a particular situation the LNO has made the wrong loop interchange choice, you can instruct it to make a correction. Just like the use of the -OPT:alias flag, you provide the compiler with the information it is lacking to make the right optimization. There are two ways to supply this extra information:

You can turn off loop interchange, or
You can tell the compiler which order you want the loops in.

Loop interchange is disabled with the following flag and directives:

-LNO:interchange=off
#pragma no interchange
c*$* no interchange

The flag applies to the entire file being compiled, and the directives apply to just the immediately following loop:

#pragma no interchange
for (i=0; i<n; i++) {
   for (j=0; j<m; j++) {
      a[j][i] = 0.0;
   }
}

To specify a particular order for a loop nest, the following directive is available:

#pragma interchange (i, j [,k ...])
c*$* interchange (i, j [,k ...])

This directive instructs the compiler to try to order the loops so that i is outermost, j is inside it, and so on (i, j, k, etc. are the loop index variables). The loops should be perfectly nested (that is, there should be no code between the do or for statements). The compiler will try to order the loops as requested, but it is not guaranteed that it will be able to do so.

The LNO's loop interchange choices are correct for most situations. But if you have additional information which you can supply to help out the compiler, you should take the initiative to provide it.

3.5.4.5 Loop Interchange and Outer Loop Unrolling

Note that loop interchange and outer loop unrolling can be combined to solve some performance problems neither technique can on its own. For example, in the following loop nest from the previous seciton,

for (i=0; i<n; i++) {
   for (j=0; j<m; j++) {
      a[j][i] = a[j][i] + a[j][i-1];
   }
}

interchange improves cache performance, but the recurrence on i will limit software pipeline performance. But if the j-loop is unrolled after interchange, the recurrence will be mitigated since there will be several independent streams of work which can be used to fill up the processor's functional units:

for (j=0; j<m; j+=4) {
   for (i=0; i<n; i++) {
      a[j+0][i] = a[j+0][i] + a[j+0][i-1];
      a[j+1][i] = a[j+1][i] + a[j+1][i-1];
      a[j+2][i] = a[j+2][i] + a[j+2][i-1];
      a[j+3][i] = a[j+3][i] + a[j+3][i-1];
   }
}

The LNO considers interchange, unrolling and blocking together to achieve the overall best performance.

3.5.4.6 Controlling Cache Blocking

The LNO performs cache blocking automatically. So a matrix multiply loop such as

   for (j=0; j<n; j++) {
      for (i=0; i<m; i++) {
         for (k=0; k<l; k++) {
            c[i][j] += a[i][k]*b[k][j];
         }
      }
   }

will be transformed into a blocked loop nest something like the following:

   for (ii=0; ii<m; ii+=b1) {
      for (kk=0; kk<l; kk+=b2) {
         for (j=0; j<n; j++) {
            for (i=ii; i<MIN(ii+b1-1,m); i++) {
               for (k=kk; i<MIN(kk+b2-1,l); k++) {
                  c[i][j] += a[i][k]*b[k][j];
               }
            }
         }
      }
   }

The transformed code will actually be a bit more complicated than this: register blocking will be used on the i- and j-loops, separate cleanup loops will be generated, and the compiler will take multiple levels of cache into consideration.

For the majority of programs, these transformations will be the proper optimizations to make. But in some cases, they can hurt performance. For example, if the data already fit in cache, the blocking introduces overhead and should be avoided. Several flags and directives are available to fine-tune the blocking that the LNO performs.

If a loop operates on in-cache data or if you have already done your own blocking, the LNO's blocking may be turned off completely via the following flag and directive:

-LNO:blocking=off
#pragma no blocking
c*$* no blocking

If you would like to use a different block size than the default calculated by the LNO, there are two ways to do this. You can specify a specific block size with the following flag and directives:

-LNO:blocking_size=[n1][,n2]
#pragma blocking size (n1,n2)
#pragma blocking size (n1)
#pragma blocking size (,n2)
C*$* BLOCKING SIZE (n1,n2)
C*$* BLOCKING SIZE (n1)
C*$* BLOCKING SIZE (,n2)

These specify the size of the blocking to be used for the primary (n1) and secondary (n2) caches. If the compiler ends up blocking the loop, it will use the size specified, but due to interactions with other optimzations, there is no guarantee that the compiler will be able to block all loops. Since primary cache misses can generally be overlapped with other work, you rarely need to block for the primary cache. To block for just the secondary cache, set n1 to 0.

The directives apply to the loop immediately following them. If a blocking size is set to zero, then that loop will be included inside the blocking the compiler generates. Here is an example:

      subroutine amat(x,y,z,n,m,mm)
      real*8 x(1000,1000), y(1000,1000), z(1000,1000)

      do k = 1, n
C*$* BLOCKING SIZE (0,200)
         do j = 1, m
C*$* BLOCKING SIZE (0,200)
            do i = 1, mm
               z(i,k) = z(i,k) + x(i,j)*y(j,k)
            enddo
         enddo
     enddo
     end

For this example, the compiler will try to make 200x200 blocks for the j- and i-loops. But there is no requirement that it interchange loops so that the k-loop is inside the other two. You can instruct the compiler that you definitely want this to be done by inserting a directive with a block size of 0 above the k-loop:

      subroutine amat(x,y,z,n,m,mm)
      real*8 x(1000,1000), y(1000,1000), z(1000,1000)

C*$* BLOCKING SIZE (0,0)
      do k = 1, n
C*$* BLOCKING SIZE (0,200)
         do j = 1, m
C*$* BLOCKING SIZE (0,200)
            do i = 1, mm
               z(i,k) = z(i,k) + x(i,j)*y(j,k)
            enddo
         enddo
     enddo
     end

This produces a blocking of the form

      subroutine amat(x,y,z,n,m,mm)
      real*8 x(1000,1000), y(1000,1000), z(1000,1000)
      do jj = 1, m, 200
         do ii = 1, mm, 200
            do k = 1, n
               do j = jj, MIN(m, jj+199)
                  do i = ii, MIN(mm, ii+199)
                     z(i,k) = z(i,k) + x(i,j)*y(j,k)
                  enddo
               enddo
            enddo
         enddo
      enddo
      end

The second way to affect block sizes is to modify the compiler's cache model. The LNO makes blocking decisions at compile-time using a fixed description of the processor's cache system. While the -Ofast=ip27 flag tells the compiler to optimize for Origin systems, it can't know what size the secondary cache will be in the system on which the program runs since both 1 MB and 4 MB secondary caches are available for Origin2000. As a result, it chooses to optimize for the lowest common denominator, the 1 MB cache.

There are several flags available with which you can tell the compiler to optimize for a specific cache size and structure. Those for the primary cache are

-LNO:cache_size1=n
-LNO:line_size1=n
-LNO:associativity1=n
-LNO:clean_miss_penalty1=n
-LNO:is_memory_level1=on|off

One may also specify these parameters for the level-2 cache, and there are other parameters for describing the TLB cache, too. In general, the defaults work well, so you'll probably never need to use these. But the cc(1) and f77(1) reference pages describe them all in case you ever feel the need to try to tune them.

3.5.4.7 Loop Fusion and Fission

Earlier we saw that loop fusion is a standard technique which can improve cache performance. This is an optimization which the LNO performs automatically. It will even perform loop peeling, if necessary, to be able to fuse loops. For example,

      do i = 1, n
         a(i) = 0.0
      enddo

      do i = 2, n-1
         c(i) = 0.5 * (b(i+1) + b(i-1))
      enddo

gets fused into something similar to

      a(1) = 0.0

      do i = 2, n-1
         a(i) = 0.0
         c(i) = 0.5 * (b(i+1) + b(i-1))
      enddo

      a(n) = 0.0

The first and last iterations of the first loop are peeled off so that we end up with two loops which run over the same iterations; these two loops may then trivially be fused.

While loop fusion generally improves cache performance, it does have one potential drawback: it makes the body of the loop larger. Large loop bodies place greater demands on the compiler: More registers are required to efficiently schedule the loop. The compilation takes longer since the algorithms for register allocation and software pipelining grow faster than linearly with the size of the loop body. To balance these negative aspects the LNO also performs loop fission, whereby large loop are broken into smaller, more manageable ones. For example, register use in a loop such as the following

for (j=0; j<n; j++) {
   ...
   s = ...
   ...
   ... = s
   ...
}

can be reduced by breaking the loop into two pieces:

for (j=0; j<n; j++) {
   ...
   se[j] = ...
   ...
}

for (j=0; j<n; j++) {
   ...
   ... = se[j]
   ...
}

The use of the register s is eliminated by storing the intermediate value in the array se[].

But loop fission has other benefits. In the loop nest below,

for (j=0; j<n; j++) {
   for (i=0; i<n; i++) {
      b[i][j] = a[i][j];
   }
   for (i=0; i<n; i++) {
      c[i][j] = b[i+m][j];
   }
}

the cache behavior is likely to be poor due to the non-unit stride accesses. If the loops could be reversed, this problem would be fixed. Loop fission, by separating the two inner loops, allows the LNO to interchange both and improve the cache behavior:

for (i=0; i<n; i++) {
   for (j=0; j<n; j++) {
      b[i][j] = a[i][j];
   }
}

for (i=0; i<n; i++) {
   for (j=0; j<n; j++) {
      c[i][j] = b[i+m][j];
   }
}

Like most optimizations, loop fission presents some tradeoffs. Some common subexpression elimination which occurs when all the code is in one loop can be lost when the loops are split. In addition, more loops means more code, so instruction cache performance can suffer. Loop fission also needs to be balanced with loop fusion, which has its pwn benefits and liabilities.

The compiler attempts to mix both optimizations, as well as the others it performs, to produce the best code possible. Since different mixtures of optimizations are required in different situations, it is impossible for the compiler to always achieve the best performance. As a result, several flags and directives are provided to control loop fusion and fission:

Flags: -LNO:fission=0|1|2; -LNO:fusion=0|1|2; -LNO:fusion_peeling_limit=n
C pragmas: #pragma aggressive inner loop fission; #pragma fission [ (n) ]; #pragma fissionable; #pragma fuse [(n [,level] )]; #pragma fusable; #pragma no fission; #pragma no fusion
Fortran directives: C*$* AGGRESSIVE INNER LOOP FISSION; C*$* FISSION [(n)]; C*$* FISSIONABLE; C*$* FUSE [(n [,level] )]; C*$* FUSABLE; C*$* NO FISSION; C*$* NO FUSION

The default flags are -LNO:fusion=1 and -LNO:fission=1. This default behavior means that outer loops are fused and fission is performed as necessary, with fusion preferred over fission. One can specify that fission is preferred over fusion by setting -LNO:fission=2. If one also sets -LNO:fusion=2, then fusion will once again be preferred to fission, and more aggression fusion will be performed. These flags apply to all loops in the file being compiled.

Fission may be disabled for all loops in a file using the -LNO:fission=0 flag. It is disabled on a loop-by-loop basis with the directive #pragma no fission for C, and C*$* NO FISSION for Fortran. Similarly, fusion is disabled using -LNO:fusion=0, #pragma no fusion, or C*$* NO FUSION.

The remaining fusion and fission directives are described in the cc(1) and f77(1) reference pages.

3.5.4.8 Prefetching

As mentioned above, the MIPS IV ISA contains prefetch instructions which can move data from main memory into cache in advance of their use. This allows some or all of the latency required to access the data to be hidden behind other work. One of the optimizations the LNO performs is to insert prefetch instructions into your program. We can see how this works by considering a few simple examples:

for (i=0; i<n; i++) {
   a += b[i];
}

If the above loop were to be executed exactly as written, every secondary cache miss of the vector b[] would stall the processor since there is no other work to do. Thus, the full cost of the latency required to read the cache lines in from memory would have to be paid.

Now consider the following modification:

for (i=0; i<n; i++) {
   prefetch b[i+16];
   a += b[i];
}

Here, prefetch <addr> is not a C statement; it is meant to indicate that a prefetch instruction for the data at the specified address is issued at that point in the program. If b[] is a double array, the address b[i+16] is one full cache line (128 bytes) ahead of the value which is being read in the current iteration, b[i]. With this prefetch instruction inserted into the loop, each cache line is prefetched 16 iterations prior to when it needs to be used. Now, each iteration of the loop will take two cycles since it is load-bound, and the prefetch instruction and the load of b[i] will each take one cycle. Thus, cache lines will be prefetched 32 cycles in advance of their use. A cache miss to local memory takes approximately 60 cycles, so half of this latency will be overlapped with work on the previous cache line.

As written, the above loop issues a prefetch instruction for every load. For out-of-cache data, this is not so bad since we still have to wait for half of the latency of the cache miss. But if this same loop were run on in-cache data, the performance would be half of what it could be. Since we only need one prefetch instruction per cache line, we might instead think of inserting the prefetches as follows:

for (i=0; i<n; i++) {
   if ((i % 16) == 0) prefetch b[i+16];
   a += b[i];
}

This implementation issues just one prefetch instruction per cache line, but the if costs more than issuing redundant prefetches, so it is no improvement. A better way to prefetch is to unroll the loop:

for (i=0; i<n; i+=2) {
   prefetch b[i+16];
   a += b[i+0];
   a += b[i+1];
}

Now, half as many prefetches are issued, so the in-cache case pays for 8 fewer redundant prefetch instructions. By further unrolling, the redundant prefetches can be reduced even more; in fact, unrolling by 16 reduces them completely.

The LNO automatically inserts prefetch instructions, and it uses unrolling to limit the number of redundant prefetch instructions it generates. Generally, it will not unroll the loop enough to completely eliminate the redundancies since the extra unrolling, like too much loop fusion, can have a negative impact on software pipelining. So, in prefetching, as in all its other optimzations, the LNO tries to achieve a delicate balance.

There is one further thing the LNO does to improve prefetch performance. Instead of prefetching just one cache line ahead, it prefetches two or more cache lines ahead so that most, if not all, of the memory latency is overlapped. For the loop above, the compiler inserts prefetches as follows:

for (i=0; i<n; i+=8) {
   prefetch b[i+32];
   a += b[i+0];
   a += b[i+1];
   a += b[i+2];
   a += b[i+3];
   a += b[i+4];
   a += b[i+5];
   a += b[i+6];
   a += b[i+7];
}

Thus, data are prefetched two cache lines ahead, and only one redundant prefetch per cache line is made.

3.5.4.9 Pseudo Prefetching

Prefetch instructions can be used to move data into the primary cache from either main memory or the secondary cache. But since the latency to the secondary cache is small, 8--10 cycles, it can usually be hidden without resorting to prefetch instructions. Consider the following unrolled loop:

for (i=0; i<n; i+=4) {
   a += b[i+0];
   a += b[i+1];
   a += b[i+2];
   a += b[i+3];
}

By introducing a temporary register and reordering the instructions, this becomes

for (i=0; i<n; i+=4) {
   t  = b[i+3];
   a += b[i+0];
   a += b[i+1];
   a += b[i+2];
   a += t;
}

Written this way, data in the next cache line, b[i+4](primary cache lines are 32 bytes), will be referenced several cycles before they are needed, and so will be overlapped with operations on the current cache line.

This technique is called pseudo prefetching, and it is the default way that the LNO hides the latency to the secondary cache. Since no prefetch instructions are used, it introduces no overhead for in-cache data. It does require the use of additional registers, however.

In case you ever want to see how performance is affected by using real prefetches rather than pseudo prefetches from the secondary cache, the flag -CG:pf_l1=true may be used to convert the pseudo prefetches into genuine prefetch instructions.

3.5.4.10 Controlling Prefetching

In order to provide the best performance for the majority of programs, the LNO generally assumes that data whose sizes it cannot determine are not cache-resident. It generates prefetch instructions in loops involving these data. Sometimes --- for example, as a result of cache blocking ---it can determine that certain cache misses will be satisfied from the secondary cache rather than main memory. For these misses pseudo prefetching is used. While this default behavior works well most of the time, it may impose overhead costs in cases where the data are small enough to fit in cache, or where you have already put cache blocking into the program. To fine tune such situations, several flags and directives are provided to control prefetch.

The flag -LNO:prefetch controls whether prefetching is enabled or disabled. The default is -LNO:prefetch=1, which means that prefetches are enabled and used for known, large data structures. They are also used conservatively for data structures whose size the compiler cannot determine. This is the level of prefetching that should be used most of the time. In cases where the data are known to fit in the primary cache, overhead can be reduced by disabling prefetching altogether. This is done with the flag -LNO:prefetch=0. Finally, the level -LNO:prefetch=2 is provided to tell the compiler to prefetch aggressively for data structures whose sizes are unknown. If you run into a case in which the default -LNO:prefetch=1 has not automatically generated a prefetch for an array that needs one, this more aggressive prefetching level may help.

Since the primary cache is small, situations which warrant completely disabling prefetching are rare. More often, small data sets are too big for the primary cache, but fit in the secondary cache. In these cases, it may be useful to disable prefetch from main memory, but still have the compiler perform pseudo prefetching from the secondary to primary caches. This can be done with the flag -LNO:prefetch_level2=off (or, equivalently, -LNO:pf2=off). This is a specific instance of the general option -LNO:prefetch_leveln=on|off (or, equivalently, -LNO:pfn=on|off) which enables/disables prefetch for cache level n. Since the R10000 has two levels of cache, -LNO:pf1=off:pf2=off is equivalent to -LNO:prefetch=0.

By default, the LNO prefetches data two cache lines ahead, as demonstrated above. If this is not sufficient to allow maximal overlap of data movement with other work, it can be increased with the flag -LNO:prefetch_ahead=n. For simple vector loops such as a copy or DAXPY, prefetching 2 cache lines ahead is definitely better than 1; however, only a slight improvement, if any, may be observed by increasing the prefetch ahead value from 2 to 3 or 4. Prefetching more than 4 cache lines ahead does not help, though, since the R10000 allows up to 4 outstanding cache misses at a given time. Generating an additional miss once 4 are already active causes the processor to stall (i.e., wait until one of the misses has completed moving its data into the cache).

These prefetch flags, of course, apply to all code in the file being compiled. In addition to the flags, there are also directives to allow finer control of prefetching within a file. While some of the directives just provide an alternate means of applying what the flags do, others allow you to insert prefetch directives exactly where you want them. This can be a very useful capability, but it is definitely intended for sophistocated users. Before proceeding, we mention one additional flag: -LNO:prefetch_manual=on|off. This flag may be used to enable/disable the directives en masse. This will allow you to test whether manually inserted directives are of benefit. Now on to the directives.

#pragma prefetch (n1 [ ,n2 ])
#pragma prefetch_manual (n)
C*$* PREFETCH (n1[,n2])
C*$* PREFETCH_MANUAL (n)

do exactly what their counterpart flags do, but they apply to just the function or subroutine containing the directive. The n1 and n2 values for the prefetch directive specify the behavior for the two cache levels.

The directive

#pragma prefetch_ref_disable=A
C*$* PREFETCH_REF_DISABLE=A

explicitly disables prefetching all references to the array A in the current function or subroutine. Automatic prefetching, unless it is also disabled (C*S$* PREFETCH (0), -LNO:prefetch=0), is done for the other arrays in the function or subroutine.

Manual Prefetching

Automatic prefetching generally does a good job. For standard vector code, you won't be able to do better, even writing in assembler. But it can't do a perfect job for all code. One of the limitations automatic prefetching has is that it only applies to dense arrays. If sparse data structures are being used, it is not always possible for the compiler to analyze where prefetch instructions should be placed. If your code does not access data in a regular fashion using loops, prefetches will also not be used. Furthermore, data accessed in outer loops may not be prefetched. So to allow you to insert prefetch instructions exactly where they need to be, the following directive is provided:

#pragma prefetch_ref=array-ref,
        [stride=[str][,str]], [level=[lev][,lev]], [kind=[rd/wr]],
        [size=[sz]]

C*$* PREFETCH_REF=array-ref,[stride=[str][,str]],
C*$*& [level=[lev][,lev]], [kind=[rd/wr]], [size=[sz]]

This directive generates a single prefetch instruction to the memory location specified by array-ref. This memory location does not have to be a compile-time constant expression: it can involve any program variables, for example, a(i+16,j). Automatic prefetching, if enabled, will then ignore all references to this array in the loop nest containing the directive; this allows manual prefetching to be used in conjunction with automatic prefetching on other arrays.

Ideally, exactly one prefetch instruction per cache line should be used. The stride option is provided to allow you to specify how often the prefetch instruction is to be issued. For example, in the following code

c*$* prefetch_ref=a(1)
c*$* prefetch_ref=a(1+16)
      do i = 1, n
c*$* prefetch_ref=a(i+32), stride=16, kind=rd
         sum = sum + a(i)
      enddo

the stride=16 clause in the prefetch_ref directive tells the compiler it should only insert a prefetch instruction once every 16 iterations of the loop. As with automatic prefetching, the compiler will unroll the loop and place just one prefetch instruction in the loop body. Since there is a limit to the amount of unrolling which is benficial, the compiler may choose to unroll less than that specified in the stride value.

The MIPS IV prefetch instructions allow one to specify whether the prefetched line is intended to be read from or written into. For the R10000 this information can be used to stream some data through one set of the cache while leaving other data in the second set, and this can result in fewer cache conflicts. Thus, it is a good idea to specify whether the prefetched data are to read or written. If they will be both read and written, specify that they are to be written.

One may specify for which level in the memory hierarchy to prefetch with the level=[lev][,lev] clause. Most often one needs only to prefetch from main memory into L1, so the default level of 2 is usually exactly what is needed

One may use the size=sz clause to tell the compiler how much space in the cache the array being manually prefetched is expected to use. This value is then used to help decide which other arrays should be automatically prefetched. If the compiler believes there is a lot of available cache space, other arrays are less likely to be prefetched since there is room for them --- or, at least, blocks of them --- to reside in the cache. On the other hand, if the compiler believes the other arrays won't fit in the cache, it will generate code which will stream them though the cache, and it will use prefetch instructions to improve the out-of-cache performance. So, when using the prefetch_ref directive, if you find that other arrays are not automatically being prefetched, you can use this option with a large sz value to encourage the compiler to generate prefetch instructions for them. Conversely, if other arrays are being prefetched and they should not be, a small sz value may encourage the compiler to not prefetch the other arrays.

The best way to understand when the prefetch_ref directive can provide improvement over automatic prefetching is through an example. Below, a detailed case study has been provided which demonstrates how this directive, as well as some other cache techniques and compiler flags, are used.

3.5.4.11 Padding

As we saw earlier, unlucky alignment of arrays can cause significant performance penalties due to cache conflicts, and it is particularly easy to get unlucky if array dimensions are a power-of-two in size. Fortunately, the LNO automatically pads arrays to eliminate or reduce such cache conflicts. Local variables, such as those in the following

      dimension a(max), b(max), c(max), d(max)

      do i = 1, max
         a(i) = b(i) + c(i)*d(i)
      enddo

will be spaced out so that cache thrashing does not occur. Note, though, that if you compile this with -flist, you won't see padding explicitly in the listing file.

The MIPSpro 7.1 compiler also pads C global variables and common blocks such as

      subroutine sub
      common a(512,512), b(512*512)

when the highest optimization level, -O3 or -Ofast, is used. It does this by splitting off each variable into its own common block. This optimzation, however, only occurs if

all routines which use the common block are compiled at -O3,
the compiler detects no inconsistent uses of the common block across routines.

If any routine which uses the common block is compiled at an optimization level lower than -O3, or if there are inconsistent uses, the splitting, and hence the padding, will not be performed. Instead, the linker will paste the common blocks back together and the original unsplit version will be used. Thus, this implementation of common block padding is perfectly safe, even if a mistake in usage is made.

There is one thing a programmer can do, though, that can cause problems. If out-of-bounds array accesses are made, wrong answers can result. For example, one not uncommon usage is to operate on all data in a common block as if they were contiguous:

      subroutine sub
      common a(512,512), b(512*512)

      do i = 1, 2*512*512
         a(i) = 0.0
      end

Here, both a and b are zeroed out by indexing only though a. This code will not work as expected if common block padding is used. Note that this is not legal Fortran.

There are three ways to deal with problems of this sort:

Compile those routines involving this common block at -O2 instead of -O3. This is not suggested since it is lkely to severely hurt performance.
Compile using the flag -OPT:reorg_common=off. This will disable the common block padding optimization. While it will work around the coding mistake, it will mean other common blocks will also not be padded and may result in less than optimal performance.
Fix the code so that it is legal Fortran!

Naturally, we recommend the last alternative. To detect such programming mistakes, the -C (or, equivalently, -check_bounds) flag should be used. Be sure to test your Fortran programs with this flag if you should ever encounter incorrect answers as a result of compiling with -O3.

If inter-procedural analysis is used, the compiler can also perform padding inside common block arrays. For example, in the common block below,

      common a(1024,1024)

the compiler will be able to pad the first dimension of a if the -IPA (or -Ofast) flag is used.

Note: the MIPSpro 7.0 compiler uses a different mechanism to pad common blocks. This is controlled by the flag -OPT:pad_common=on|off. This implementation does not do consistency checks, and so is potentially unsafe. If you still have makefiles which enable this optimzation, they should be changed to no longer use this option.

3.5.4.12 Gather-Scatter and Vector Intrinsics

The LNO performs two other optimizations on loops: gather-scatter and converting scalar math intrinsics to vector calls.

The gather-scatter optimization is best explained via an example. In the following code fragment,

      subroutine fred(a,b,c,n)
      real*8 a(n), b(n), c(n)

      do i = 1, n
         if (c(i) .gt. 0.0) then
            a(i) = c(i)/b(i)
            c(i) = c(i)*b(i)
            b(i) = 2*b(i)
         endif
      enddo

      end

the loop can be software pipelined even though there is a branch inside it; the resulting code, however, will not be particularly fast: the compiler needs to generate instructions for the code protected by the if, and these will be carried out in every iteration with the results being updated only if the conditional test is true. Usually, a faster implementation is the following:

      do i = 1, n
         deref_gs(inc_0 + 1) = i
         if (c(i) .GT. 0.0) then
            inc_0 = (inc_0 + 1)
         endif
      enddo

      do ind_0 = 0, inc_0 + -1
         i_gs = deref_gs(ind_0 + 1)
         a(i_gs) = (c(i_gs)/b(i_gs))
         c(i_gs) = (b(i_gs)*c(i_gs))
         b(i_gs) = (b(i_gs)*2.0)
         endif
      enddo

Here, deref_gs is a scratch array allocated off the stack by the compiler.

In this transformed code, we first scan through the data gathering together the iterations for which the conditional test is true. Then the second loop does the work on the gathered data. Since the conditional is true for all these iterations, the if statement may be removed from the second loop allowing it to be very efficiently sofware pipelined. In addition, if a significant number of iterations did not pass the test, a lot of speculative work has been eliminated.

By default, the compiler will perform this optimization for loops which have non-nested if statements. For loops which have nested if statements, the optimization can be tried by using the flag -LNO_gather_scatter=2. For these loops, the optimization is less likely to be advantageous, so this is not the default. In some instances, it could turn out that this optimization slows the code down (e.g., if n is not very large). In such cases, one can disable this optimization completely with the flag -LNO:gather_scatter=0.

Vector Intrinsics

In addition to the gather_scatter optimzation, the LNO will also convert scalar math intrinsic calls into vector calls so they can take advantage of the vector math routines in the math library. Thus, loops such as

      subroutine vfred(a)
      real*8 a(200)
      do i = 1, 200
         a(i) = a(i) + cos(a(i))
      enddo
      end

will be converted into the following:

      CALL vcos$(a(1), deref_se1_F8(1), %val((201 - 1)), %val(1), %val(1))
      DO i = 1, 200, 1
        a(i) = (a(i) + deref_se1_F8(i))
      END DO

Here, deref_se1_F8 is a scratch array allocated off the stack by the compiler.

The vector version is almost always faster than the scalar version, but the results using the vector routines may not agree bit-for-bit with the scalar routines. If for some reason it is critical to have bit-for-bit agreement, then one can disable this optimization with the flag -LNO:vintr=off.

3.5.4.13 What Can You Do?

So what can you do to help the loop nest optimizer generate the fastest code?

Be sure to use the most aggressive aliasing option you can with C code. If you can use -OPT:alias=restrict or -OPT:alias=disjoint, by all means do.
Use the ivdep directive with those loops for which it is valid. If need be, modify its meaning via the -OPT:cray_ivdep and -OPT:liberal_ivdep flags.
Avoid equivalences in Fortran code since they, too introduce aliasing, and will prevent padding and other optimizations.
Avoid goto statements. These obscure control flow and prevent the compiler from analyzing your code as effectively as possible.
Keep common blocks consistent so that the LNO can properly pad them.
Don't unroll loops by hand (unless you don't want the compiler to unroll them).
Don't violate the Fortran standard. Out-of-bounds array refernces can cause the common block padding optimizations to generate wrong results.

3.5.4.14 Example: Using Prefetch Directives on a 3D-Stencil Code

The loop nest optimizer will automatically insert prefetch instructions into programs. This works well for for one-dimensional vector operations, but when multi-dimensional arrays are involved, it may not be possible to determine the proper prefetches for data indexed by non-innermost loops. This sidebar shows how LNO's prefetch directives may be used to improve performance in such situations.