Performance Tuning
for the
Origin2000
-
Origin2000
Architecture
-
Introduction:
Scalable Shared Memory
-
Shared
Memory Without a Bus
-
Node Board
-
Cache Coherence
-
I/O System
-
Configurations
-
Latencies
and Bandwidths
-
R10000
Architecture
-
4-way
Superscalar Architecture
-
MIPS IV ISA
-
Cache
Architecture
-
Out-of-Order
Execution
-
L2
Resource Conflicts
-
Origin2000
Programming
-
It's Just
Shared Memory
-
Cellular
IRIX Memory Locality Management
-
Memory Locality
-
Memory
Locality Domains
-
Policy Modules
-
Recommendations
for Achieving Good Performance
-
Single Processor
Tuning
-
Step 1: Get the
Right Answers
-
Porting Issues
-
Symbolic
Debuggers
-
Step 2: Use
Existing Tuned Code
-
libfastm
-
complib.sgimath
-
Documentation
-
Step 3: Find
Out Where to Tune
-
Profiling
Tools
-
Hardware
Counter Registers
-
How
to Do Performance Analysis Using perfex
-
Using SpeedShop
-
PC
Sampling Profiling
-
Using
prof
-
Ideal
Time Profiling
-
Gprof
-
Usertime
-
Hardware
Counter Profiling
-
Exception
Profiling
-
Address
Space Profiling
-
Step 4:
Let the Compiler Do the Work
-
Recommended
Flags
-
Don't
Rely on Defaults
-
-On
-
-r10000,
-TARG:proc=r10000 & -TARG:platform=ipxx
-
-OPT:IEEE_arithmetic
-
-OPT:roundoff
-
Aliasing
Models
-
Software
Pipelining
-
Example:
DAXPY
-
The
Software Pipelined Schedule for DAXPY
-
Software
Piplining Messages
-
Use
-O3 to Enable Software Pipelining
-
Software
Pipelining Failures
-
Other
Compiler Options and Directives
-
-OPT:alias=restrict
and -OPT:alias=disjoint
-
The
ivdep
Directive
-
C Loops
-
-TENV:X=n
-
Inter-Procedural
Analysis
-
Inlining
-
IPA
Programming Hints
-
Step
5: The Cache Hierarchy and Loop Nest Optimizations
-
Cache Basics
-
Stride-1
Accesses
-
Group
Together Data Used at the Same Time
-
Cache
Thrashing and Array Padding
-
Use perfex
and speedshop to Identify Cache Problems
-
Standard
Techniques for Improving Cache Performance
-
Loop Fusion
-
Cache locking
-
Transposes
-
Using
Larger Page Sizes to Reduce TLB Misses
-
Loop
Nest Optimizations
-
Running the
LNO
-
Visualizing
Transformations
-
Outer
Loop Unrolling
-
Loop
Interchange
-
Loop
Interchange and Outer Loop Unrolling
-
Controlling
Cache Blocking
-
Loop
Fusion and Fission
-
Prefetching
-
Pseudo
Prefetching
-
Controlling
Prefetching
-
Padding
-
Gather-Scatter
and Vector Intrinsics
-
What Can
You Do?
-
Example:
Using Prefetch Directives on a 3D Stencil Code
-
Multi-Processor
Programming
-
Introduction
-
Parallel
Speedup and Amdahl's Law
-
Adding
CPUs to Shorten Execution Time
-
Parallel
Speedup S(P)
-
Amdahl's Law
-
Calculating
the Value of F
-
Predicting
Execution Time with P CPUs
-
Explicit
Models of Parallel Computation
-
Fortran
Source with Directives
-
C and
C++ Source with Pragmas
-
Message-Passing
Models MPI and PVM
-
C
Source Using POSIX Threads
-
C
and C++ Source Using UNIX Processes
-
Compiling
Serial Code for Parallel Execution
-
Tuning
Parallel Code for Origin2000
-
Message
Passing Programs
-
Page
Placement Issues
-
Using
dplace to Distribute Processes and Data
-
Using
Default and Round-Robin Allocation
-
Using
Dynamic Page Migration
-
Using
Compiler Directives to Distribute Data
-
Cache
Contention
-
Correcting
Cache Contention
Reference Material