Pipeline Optimization
Overview
“We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil”
Locating the bottleneck
Performance measurements – Donald Knuth
Optimizations Make it run first, then optimize
Balancing the pipeline But only optimize where it makes any difference
Other optimizations: multi-processing, parallel processing Pipeline Optimization: Process to maximize the rendering speed,
then allow stages that are not bottlenecks to consume as much time
as the bottleneck.
ITCS 4010/5010:Game Engine Design 1 Pipeline Optimization ITCS 4010/5010:Game Engine Design 2 Pipeline Optimization
Locating the Bottleneck
Pipeline Optimization
Two bottleneck location techniques:
Stages execute in parallel Technique 1:
Always the slowest stage is the bottleneck of the pipeline ◦ Make a certain stage work less
The bottleneck determines throughput (i.e., maximum speed) ◦ If performance is the better, then that stage is the bottleneck
The bottleneck is the average bottleneck over a frame Technique 2:
Cannot measure intra-frame bottlenecks easily ◦ Make the other two stages work less or (better) not at all
◦ If performance is the same, then the stages not included above
Bottlenecks can change over a frame
is the bottleneck
Most important: find bottleneck, then optimize that stage!
Complication: the bus between CPU and graphics card may be bot-
tleneck (not a typical stage)
ITCS 4010/5010:Game Engine Design 3 Pipeline Optimization ITCS 4010/5010:Game Engine Design 4 Pipeline Optimization
Application (CPU) Stage the Bottleneck?
Geometry Stage the Bottleneck?
Use top, osview command on Unix, TaskManager on Windows.
Trickiest stage to test
If app uses (near) 100% of CPU time, then very likely application is
the bottleneck Why? Change in geometry workload usually changes application
and rasterizer workload.
Using a code profiler is safer.
Number of light sources only affects geometry stage:
Make CPU do less work (e.g., turn off collision-detection)
◦ Disable light sources (vertex shaders can make this simple).
Replace glVertex and glNormal with glColor
◦ If performance goes up, then geometry is bottleneck, and pro-
Makes the geometry and rasterizer do almost nothing gram transform-limited
No vertices to transform, no normals to compute lighting for, no tri- Alternately, enable all light sources; if performance stays the same,
angles to rasterize geometry stage NOT the bottleneck
If performance does not change, program is CPU-bound, or CPU- Alternately, test CPU and rasterizer instead
limited
ITCS 4010/5010:Game Engine Design 5 Pipeline Optimization ITCS 4010/5010:Game Engine Design 6 Pipeline Optimization
Rasterizer Stage the Bottleneck?
Optimization
The easiest, and fastest to test
Optimize the bottleneck stage
Simply, decrease the size of the window you render to
Only put enough effort, so that the bottleneck stage moves
◦ Does not change app. or geometry workload
◦ But rasterizer needs to fill fewer pixels Did you get enough performance?
◦ If the performance goes up, then program is “fill-limited” or “fill- ◦ Yes! Quit optimizing
bound” ◦ NO! Continute optimizing the (possibly new) bottleneck
Make rasterizer work less: Turn of texturing, fog, blending, depth If close to maximum speed of system, might need to turn to acceler-
buffering etc (if your architecture have performance penalties for ation techniques (spatial data structures, occlusion culling, etc)
these)
ITCS 4010/5010:Game Engine Design 7 Pipeline Optimization ITCS 4010/5010:Game Engine Design 8 Pipeline Optimization
Application Stage Optimization
Illustrating Optimization
Initial Steps:
◦ Turn on optimiziation flags in compiler
◦ Use code profilers, shows places where majority of time is spent
◦ This is time consuming stuff
Height of bar: time it takes for that stage for one frame Strategy 1: Efficient code
Highest bar is bottleneck ◦ Use fewer instructions
After optimization: bottleneck has moved to APP ◦ Use more efficient instructions
No use in optimizing GEOM, turn to optimizing APP instead ◦ Recode algorithmically
Strategy 2: Efficient memory access
ITCS 4010/5010:Game Engine Design 9 Pipeline Optimization ITCS 4010/5010:Game Engine Design 10 Pipeline Optimization
Appliction:Code Optimization Tricks
SIMD intstructions sets perfect for vector ops Code Optimization Tricks (contd)
◦ 2-4 operations in parallell
◦ SSE, SSE2, 3DNow! are examples Conditional branches are generally expensive;
Division is an expensive operation ◦ Avoid if-then-else if possible
◦ Between 4-39 times slower than most other instructions ◦ Sometimes branch prediction on CPUs works remarkably well
◦ Good usage Example: vector normalization: Math functions (sin, cos, tan, sqrt, exp, etc.) are expensive
Instead of ◦ Rough approximation might be sufficient
v = (vx/d, vy /d, vz /d) ◦ Can use first few terms in Taylor series
Do
Inline code is good (avoids function calls)
float (32 bits) is faster than double (64 bits); less data is sent down
d = v · v, f = 1/d, v = v ∗ f the pipeline
On some CPUs there √ are low-precision versions of (1/x) and square
root reciprocal (1/ x)
ITCS 4010/5010:Game Engine Design 11 Pipeline Optimization ITCS 4010/5010:Game Engine Design 12 Pipeline Optimization
Code Optimization Tricks (contd)
Compiler optimization: Hard to predict: –counter vs. counter–
Memory Optimization
Use const in C and C++ to help to compiler with optimization
Memory hierarchies (caches) in modern computers - primary, sec-
Following often incur overhead: ondary caches.
◦ Dynamic casting (C++) Bad memory access pattern can ruin performance
◦ Virtual methods Not really about using less memory, though that can help
◦ Inherited constructors
◦ Passing structs by value
ITCS 4010/5010:Game Engine Design 13 Pipeline Optimization ITCS 4010/5010:Game Engine Design 14 Pipeline Optimization
Memory Optimization Tricks (contd)
Memory Optimization Tricks Align data with size of cache line
◦ Example: on most Pentiums, the cache line size if 32 bytes
Sequential access: Store data in order in memory: ◦ Now, assume that it takes 30 bytes to store a vertex
◦ Padding with another 2 bytes to 32 bytes will likely perform bet-
◦ Tex Coords #0, Position #0, Tex Coords #1, Position #1, Tex
ter.
coords #2, Position #2, etc.
Following pointers (linked list) is expensive (if memory is allocated
Cache prefetching is good, but hard to control
arbitrarily)
malloc() and free() may be slow: Consider using a custom storage
◦ Does not use coherence well that cache usually exploits
allocator - allocate memory to a pool at startup
◦ That is, the address after the one we just used is likely to be
used soon
◦ Paper by Smits on ray tracing shows this.
ITCS 4010/5010:Game Engine Design 15 Pipeline Optimization ITCS 4010/5010:Game Engine Design 16 Pipeline Optimization
Geometry Stage: Optimization Geometry Stage: Optimization
Normals must be normalized to get correct lighting
Geometry stage does per-vertex ops
◦ Normalize them as a preprocess, and disable normalizing if pos-
◦ Best way to optimize: Use Triangle strips!!!
sible
Lighting optimization:
Lighting can be computed for both sides of a triangle; disable if not
◦ Spot lights expensive, point light cheaper, directional light needed.
cheapest
If light sources are static with respect to geometry, and material is
◦ Disable lighting if possible only diffuse
◦ Use as few light sources as possible
◦ Precompute lighting on CPU
◦ If you use 1/d2 fallof, then if d > 10 (example), disable light
◦ Send only precomputed colors (not normals)
ITCS 4010/5010:Game Engine Design 17 Pipeline Optimization ITCS 4010/5010:Game Engine Design 18 Pipeline Optimization
Raster Stage: Optimization
Raster Stage: Optimization
Rasterizer stage does per-pixel ops
To make rasterization faster, need to rasterize fewer (or cheaper)
Simple Optimization: turn on backface culling if possible
pixels:
Turn off Z-buffering if possible:
◦ Make window smaller
◦ Example: after screen clear, draw large background polygon ◦ Render to a smaller texture, and then enlarge texture onto
◦ Using polygon-aligned BSP trees screen
Draw in front-to-back order Depth complexity is number of times a pixel has been written to
Try disable features: texture filtering mode, fog, blending, multisam- ◦ Good for understanding behaviour of application
pling
ITCS 4010/5010:Game Engine Design 19 Pipeline Optimization ITCS 4010/5010:Game Engine Design 20 Pipeline Optimization
Depth Complexity
Overall Optimization: General Techniques
Reduce number of primitives, eg. using polygon simplification algo-
rithms
Preprocess geometry and data for the particular architecture
Turn off features not in use such as:
◦ Depth buffering, Blending, Fog, Texturing
ITCS 4010/5010:Game Engine Design 21 Pipeline Optimization ITCS 4010/5010:Game Engine Design 22 Pipeline Optimization
Overall Optimization (contd) Balancing the Pipeline
Minimize state changes by grouping objects
◦ Example: objects with the same texture should be rendered to-
gether
If all pixels are always drawn, avoid color buffer clear
Frame buffer reads are expensive The bottleneck stage sets the frame rate
Display lists may work faster The other two stages will be idle for some time
Precompile a list of primitives for faster rendering Also, to sync with monitor, there might be idle time for all stages
OpenGL API supports this Exploit this time to make quality of images better if possible
ITCS 4010/5010:Game Engine Design 23 Pipeline Optimization ITCS 4010/5010:Game Engine Design 24 Pipeline Optimization
Balancing the Pipeline Multiprocessing
Increase number of triangles (affects all stages)
More lights, more expensive (geometry)
More realistic animation, more accurate collision detection (applica-
tion)
More expensive texture filtering, blending, etc. (rasterizer)
If not fill-limited, increase window size
Note: there are FIFOs between stages (and at many other places Use this if application is bottleneck, and is affordable
too) to smooth out idleness of stages Two major ways: (1) Multiprocessor pipelining, (2) Parallel process-
More techniques in text. ing
ITCS 4010/5010:Game Engine Design 25 Pipeline Optimization ITCS 4010/5010:Game Engine Design 26 Pipeline Optimization
Summary
Pipeline optimization is no substitute for good algorithms!
Do optimization as a last step.
Primarily for products that should be shipped
Most often good to use triangle strips!
ITCS 4010/5010:Game Engine Design 27 Pipeline Optimization