Performance Tuning
for the
Origin2000


Contents

  1. Origin2000 Architecture
    1. Introduction: Scalable Shared Memory
    2. Shared Memory Without a Bus
      1. Node Board
      2. Cache Coherence
      3. I/O System
      4. Configurations
      5. Latencies and Bandwidths
    3. R10000 Architecture
      1. 4-way Superscalar Architecture
      2. MIPS IV ISA
      3. Cache Architecture
      4. Out-of-Order Execution
      5. L2 Resource Conflicts
  2. Origin2000 Programming
    1. It's Just Shared Memory
    2. Cellular IRIX Memory Locality Management
      1. Memory Locality
      2. Memory Locality Domains
      3. Policy Modules
    3. Recommendations for Achieving Good Performance
  3. Single Processor Tuning
    1. Step 1: Get the Right Answers
      1. Porting Issues
      2. Symbolic Debuggers
    2. Step 2: Use Existing Tuned Code
      1. libfastm
      2. complib.sgimath
      3. Documentation
    3. Step 3: Find Out Where to Tune
      1. Profiling Tools
      2. Hardware Counter Registers
      3. How to Do Performance Analysis Using perfex
      4. Using SpeedShop
        1. PC Sampling Profiling
        2. Using prof
        3. Ideal Time Profiling
        4. Gprof
        5. Usertime
        6. Hardware Counter Profiling
        7. Exception Profiling
      5. Address Space Profiling
    4. Step 4: Let the Compiler Do the Work
      1. Recommended Flags
        1. Don't Rely on Defaults
        2. -On
        3. -r10000, -TARG:proc=r10000 & -TARG:platform=ipxx
        4. -OPT:IEEE_arithmetic
        5. -OPT:roundoff
        6. Aliasing Models
      2. Software Pipelining
        1. Example: DAXPY
        2. The Software Pipelined Schedule for DAXPY
        3. Software Piplining Messages
        4. Use -O3 to Enable Software Pipelining
        5. Software Pipelining Failures
      3. Other Compiler Options and Directives
        1. -OPT:alias=restrict and -OPT:alias=disjoint
        2. The ivdep Directive
        3. C Loops
        4. -TENV:X=n
      4. Inter-Procedural Analysis
        1. Inlining
        2. IPA Programming Hints
    5. Step 5: The Cache Hierarchy and Loop Nest Optimizations
      1. Cache Basics
        1. Stride-1 Accesses
        2. Group Together Data Used at the Same Time
        3. Cache Thrashing and Array Padding
      2. Use perfex and speedshop to Identify Cache Problems
      3. Standard Techniques for Improving Cache Performance
        1. Loop Fusion
        2. Cache locking
        3. Transposes
        4. Using Larger Page Sizes to Reduce TLB Misses
      4. Loop Nest Optimizations
        1. Running the LNO
        2. Visualizing Transformations
        3. Outer Loop Unrolling
        4. Loop Interchange
        5. Loop Interchange and Outer Loop Unrolling
        6. Controlling Cache Blocking
        7. Loop Fusion and Fission
        8. Prefetching
        9. Pseudo Prefetching
        10. Controlling Prefetching
        11. Padding
        12. Gather-Scatter and Vector Intrinsics
        13. What Can You Do?
        14. Example: Using Prefetch Directives on a 3D Stencil Code
  4. Multi-Processor Programming
    1. Introduction
    2. Parallel Speedup and Amdahl's Law
      1. Adding CPUs to Shorten Execution Time
      2. Parallel Speedup S(P)
      3. Amdahl's Law
      4. Calculating the Value of F
      5. Predicting Execution Time with P CPUs
    3. Explicit Models of Parallel Computation
      1. Fortran Source with Directives
      2. C and C++ Source with Pragmas
      3. Message-Passing Models MPI and PVM
      4. C Source Using POSIX Threads
      5. C and C++ Source Using UNIX Processes
    4. Compiling Serial Code for Parallel Execution
    5. Tuning Parallel Code for Origin2000
      1. Message Passing Programs
      2. Page Placement Issues
      3. Using dplace to Distribute Processes and Data
      4. Using Default and Round-Robin Allocation
      5. Using Dynamic Page Migration
      6. Using Compiler Directives to Distribute Data
      7. Cache Contention
      8. Correcting Cache Contention

Reference Material