This document discusses various optimizations for the z-buffer algorithm used in 3D graphics rendering. It covers hardware optimizations like early-z testing and double-speed z-only rendering. It also discusses software techniques like front-to-back sorting, early-z rendering passes, and deferred shading. Other topics include z-buffer compression, fast clears, z-culling, and potential future optimizations like programmable culling units. A variety of resources are provided for further reading.
Z-Buffer Review
Alsocalled Depth Buffer
Fragment vs Pixel
Alternatives: Painter’s, Ray Casting, etc
4.
Z-Buffer History
“Brute-forceapproach”
“Ridiculously expensive”
Sutherland, Sproull, and,
Schumacker, “A Characterization of
Ten Hidden-Surface Algorithms”,
1974
5.
Z-Buffer Quiz
10triangles cover a pixel. Rendering
these in random order with a Z-buffer,
what is the average number of times
the pixel’s z-value is written?
See Subtle Tools Slides: erich.realtimerendering.com
6.
Z-Buffer Quiz
1st
trianglewrites depth
2nd
triangle has 1/2 chance of writing depth
3rd
triangle has 1/3 chance of writing depth
1 + 1/2 + 1/3 + …+ 1/10 = 2.9289…
See Subtle Tools Slides: erich.realtimerendering.com
Z-Test in thePipeline
When is the Z-Test?
Fragment
Shader
Fragment
Shader
Z-Test
Z-Test
or
9.
Early-Z
Avoid expensivefragment shaders
Reduce bandwidth to frame buffer
Writes not reads
Fragment
Shader
Z-Test
10.
Early-Z
Automatically enabledon GeForce (8?)
unless1
Fragment shader discards or write depth
Depth writes and alpha-test2
are enabled
Fine-grained as opposed to Z-Cull
ATI: “Top of the Pipe Z Reject”
Fragment
Shader
Z-Test
1
See NVIDIA GPU Programming Guide for exact details
2
Alpha-test is deprecated in GL 3
11.
Front-to-Back Sorting
UtilizeEarly-Z for opaque objects
Old hardware still has less z-buffer writes
CPU overhead. Need efficient sorting
Bucket Sort
Octtree
Conflicts with state sorting
0 - 0.25 0.25 – 0.5 0.5 – 0.75 0.75 - 1
0
1
1
2
12.
Double Speed Z-Only
GeForce FX and later render at double
speed when writing only depth or stencil
Enabled when
Color writes are disabled
Fragment shader discards or write depth
Alpha-test is disabled
See NVIDIA GPU Programming Guide for exact details
13.
Early-Z Pass
Softwaretechnique to utilize Early-Z
and Double Speed Z-Only
Two passes
Render depth only. “Lay down depth”
– Double Speed Z-Only
Render with full shaders and no depth
– Early-Z (and Z-Cull)
14.
Early-Z Pass
Optimizations
Depthpass
• Coarse sort front-to-back
• Only render major occluders
Shade pass
• Sort by state
• Render non-occluders depth
15.
Deferred Shading
Similarto Early-Z Pass
1st
Pass: Visibility tests
2nd
Pass: Shading
Different than Early-Z Pass
Geometry is only transformed once
16.
Deferred Shading
1st
Pass
Rendergeometry into G-Buffers:
Images from Tabula Rasa. See Resources.
Fragment Colors Normals
Depth Edge Weight
Deferred Shading
Eliminatesshading fragments that fail
Z-Test
Increases video memory requirement
How does it affect bandwidth?
20.
Buffer Compression
Reducedepth buffer bandwidth
Generally does not reduce memory
usage of actual depth buffer
Same architecture applies to other
buffers, e.g. color and stencil
21.
Buffer Compression
TileTable: Status for nxn tile of
depths, e.g. n=8
[state, zmin, zmax]
state is either compressed,
uncompressed, or cleared
0.1
0.5
0.5
0.1
0.5 0.5 0.1
0.8 0.8
0.8 0.8
0.5
0.5
0.5 0.5 0.1
[uncompressed, 0.1, 0.8]
Buffer Compression
DepthBuffer Write
Rasterizer modifies copy of uncompressed
tile
Tile is lossless compressed (if possible)
and sent to actual depth buffer
Update Tile Table
• zmin and zmax
• status: compressed or decompressed
24.
Buffer Compression
DepthBuffer Read
Tile Status
• Uncompressed: Send tile
• Compressed: Decompress and send tile
• Cleared: See Fast Clear
25.
Buffer Compression
ATI:Writing depth interferes with
compression
Render those objects last
Minimize far/near ratio
Improves Zmin
, Zmax
precision
26.
Fast Clear
Don’ttouch depth buffer
glClear sets state of each tile to
cleared
When the rasterizer reads a cleared
buffer
A tile filled with
GL_DEPTH_CLEAR_VALUE is sent
Depth buffer is not accessed
27.
Fast Clear
UseglClear
Not full screen quads
Not the skybox
No "one frame positive, one frame
negative“ trick
Clear stencil together with depth –
they are stored in the same buffer
28.
Z-Cull
Cull blocksof fragments before
shading
Coarse-grained as opposed to Early-Z
Also called Hierarchical Z
Fragment
Shader
Z-Cull
Ztriangle
min > tile’s zmax
ztriangle
min
29.
Z-Cull
Zmax-Culling
Rasterizer fetcheszmax for each tile it
processes
Compute ztriangle
min for a triangle
Culled if ztriangle
min > zmax
Fragment
Shader
Z-Cull
Ztriangle
min > tile’s zmax
ztriangle
min
30.
Z-Cull
Zmin-Culling
Support differentdepth tests
Avoid depth buffer reads
If triangle is in front of tile, depth tests
for each pixel is unnecessary
Fragment
Shader
Z-Cull
Ztriangle
max < tile’s zmin
ztriangle
max
31.
Z-Cull
Automatically enabledon GeForce (6?) cards unless
glClear isn’t used
Fragment shader writes depth (or discards?)
Direction of depth test is changed. Why?
ATI: avoid = and != depth compares on old cards
ATI: avoid stencil fail and stencil depth fail
operations
Less efficient when depth varies a lot within a few
pixels
See NVIDIA GPU Programming Guide for exact details
32.
ATI HyperZ
HyperZ=
Early Z +
Z Compression +
Fast Z clear +
Hierarchical Z
See ATI's Depth-in-depth
33.
Programmable Culling Unit
Cull before fragment shader even if
the shader writes depth or discards
Run part of shader over an entire tile
to determine lower bound z value
Hasselgren and Akenine-Möller,
“PCU: The Programmable Culling
Unit,” 2007
34.
Summary
What wasonce “ridiculously
expensive” is now the primary visible
surface algorithm for rasterization
#12 Other Software techniques include
Disable depth buffering when it is not needed, e.g. an alpha blended HUD
If using multiple depth buffers, allocate the most render-intensive one first
#24 RADEON 9500/9700 can achieve up to 24:1 compression rate in extreme cases
#31 ATI calls Z-Cull “Hierarchical Z” and NVIDIA calls it “Light Memory Architecture.”