KEMBAR78
Z Buffer Optimizations | PPT
Z-Buffer Optimizations
Patrick Cozzi
Analytical Graphics, Inc.
Overview
 Z-Buffer Review
 Hardware: Early-Z
 Software: Front-to-Back Sorting
 Hardware: Double-Speed Z-Only
 Software: Early-Z Pass
 Software: Deferred Shading
 Hardware: Buffer Compression
 Hardware: Fast Clear
 Hardware: Z-Cull
 Future: Programmable Culling Unit
Z-Buffer Review
 Also called Depth Buffer
 Fragment vs Pixel
 Alternatives: Painter’s, Ray Casting, etc
Z-Buffer History
 “Brute-force approach”
 “Ridiculously expensive”
 Sutherland, Sproull, and,
Schumacker, “A Characterization of
Ten Hidden-Surface Algorithms”,
1974
Z-Buffer Quiz
 10 triangles cover a pixel. Rendering
these in random order with a Z-buffer,
what is the average number of times
the pixel’s z-value is written?
See Subtle Tools Slides: erich.realtimerendering.com
Z-Buffer Quiz
 1st
triangle writes depth
 2nd
triangle has 1/2 chance of writing depth
 3rd
triangle has 1/3 chance of writing depth
 1 + 1/2 + 1/3 + …+ 1/10 = 2.9289…
See Subtle Tools Slides: erich.realtimerendering.com
Z-Buffer Quiz
Harmonic Series
# Triangles # Depth Writes
1 1
4 2.08
11 3.02
31 4.03
83 5
12,367 10
See Subtle Tools Slides: erich.realtimerendering.com
Z-Test in the Pipeline
 When is the Z-Test?
Fragment
Shader
Fragment
Shader
Z-Test
Z-Test
or
Early-Z
 Avoid expensive fragment shaders
 Reduce bandwidth to frame buffer
Writes not reads
Fragment
Shader
Z-Test
Early-Z
 Automatically enabled on GeForce (8?)
unless1
Fragment shader discards or write depth
Depth writes and alpha-test2
are enabled
 Fine-grained as opposed to Z-Cull
 ATI: “Top of the Pipe Z Reject”
Fragment
Shader
Z-Test
1
See NVIDIA GPU Programming Guide for exact details
2
Alpha-test is deprecated in GL 3
Front-to-Back Sorting
 Utilize Early-Z for opaque objects
 Old hardware still has less z-buffer writes
 CPU overhead. Need efficient sorting
Bucket Sort
Octtree
 Conflicts with state sorting
0 - 0.25 0.25 – 0.5 0.5 – 0.75 0.75 - 1
0
1
1
2
Double Speed Z-Only
 GeForce FX and later render at double
speed when writing only depth or stencil
 Enabled when
Color writes are disabled
Fragment shader discards or write depth
Alpha-test is disabled
See NVIDIA GPU Programming Guide for exact details
Early-Z Pass
 Software technique to utilize Early-Z
and Double Speed Z-Only
 Two passes
Render depth only. “Lay down depth”
– Double Speed Z-Only
Render with full shaders and no depth
– Early-Z (and Z-Cull)
Early-Z Pass
 Optimizations
Depth pass
• Coarse sort front-to-back
• Only render major occluders
Shade pass
• Sort by state
• Render non-occluders depth
Deferred Shading
 Similar to Early-Z Pass
1st
Pass: Visibility tests
2nd
Pass: Shading
 Different than Early-Z Pass
Geometry is only transformed once
Deferred Shading
 1st
Pass
Render geometry into G-Buffers:
Images from Tabula Rasa. See Resources.
Fragment Colors Normals
Depth Edge Weight
Deferred Shading
 2nd
Pass
Shading == post processing effects
Render full screen quads that read
from G-Buffers
Objects are no longer needed
Deferred Shading
 Light Accumulation Result
Image from Tabula Rasa. See Resources.
Deferred Shading
 Eliminates shading fragments that fail
Z-Test
 Increases video memory requirement
 How does it affect bandwidth?
Buffer Compression
 Reduce depth buffer bandwidth
 Generally does not reduce memory
usage of actual depth buffer
 Same architecture applies to other
buffers, e.g. color and stencil
Buffer Compression
 Tile Table: Status for nxn tile of
depths, e.g. n=8
[state, zmin, zmax]
state is either compressed,
uncompressed, or cleared
0.1
0.5
0.5
0.1
0.5 0.5 0.1
0.8 0.8
0.8 0.8
0.5
0.5
0.5 0.5 0.1
[uncompressed, 0.1, 0.8]
Buffer Compression
Tile
Table
Decompress Compress
Compressed Z-Buffer
Rasterizer
updated
z-values
updated z-max
nxn uncompressed z values
[zmin, zmax]
Buffer Compression
 Depth Buffer Write
Rasterizer modifies copy of uncompressed
tile
Tile is lossless compressed (if possible)
and sent to actual depth buffer
Update Tile Table
• zmin and zmax
• status: compressed or decompressed
Buffer Compression
 Depth Buffer Read
Tile Status
• Uncompressed: Send tile
• Compressed: Decompress and send tile
• Cleared: See Fast Clear
Buffer Compression
 ATI: Writing depth interferes with
compression
Render those objects last
 Minimize far/near ratio
Improves Zmin
, Zmax
precision
Fast Clear
 Don’t touch depth buffer
 glClear sets state of each tile to
cleared
 When the rasterizer reads a cleared
buffer
A tile filled with
GL_DEPTH_CLEAR_VALUE is sent
Depth buffer is not accessed
Fast Clear
 Use glClear
Not full screen quads
Not the skybox
No "one frame positive, one frame
negative“ trick
 Clear stencil together with depth –
they are stored in the same buffer
Z-Cull
 Cull blocks of fragments before
shading
 Coarse-grained as opposed to Early-Z
 Also called Hierarchical Z
Fragment
Shader
Z-Cull
Ztriangle
min > tile’s zmax
ztriangle
min
Z-Cull
 Zmax-Culling
Rasterizer fetches zmax for each tile it
processes
Compute ztriangle
min for a triangle
Culled if ztriangle
min > zmax
Fragment
Shader
Z-Cull
Ztriangle
min > tile’s zmax
ztriangle
min
Z-Cull
 Zmin-Culling
Support different depth tests
Avoid depth buffer reads
If triangle is in front of tile, depth tests
for each pixel is unnecessary
Fragment
Shader
Z-Cull
Ztriangle
max < tile’s zmin
ztriangle
max
Z-Cull
 Automatically enabled on GeForce (6?) cards unless
 glClear isn’t used
 Fragment shader writes depth (or discards?)
 Direction of depth test is changed. Why?
 ATI: avoid = and != depth compares on old cards
 ATI: avoid stencil fail and stencil depth fail
operations
 Less efficient when depth varies a lot within a few
pixels
See NVIDIA GPU Programming Guide for exact details
ATI HyperZ
 HyperZ =
Early Z +
Z Compression +
Fast Z clear +
Hierarchical Z
See ATI's Depth-in-depth
Programmable Culling Unit
 Cull before fragment shader even if
the shader writes depth or discards
 Run part of shader over an entire tile
to determine lower bound z value
 Hasselgren and Akenine-Möller,
“PCU: The Programmable Culling
Unit,” 2007
Summary
 What was once “ridiculously
expensive” is now the primary visible
surface algorithm for rasterization
Resources
www.realtimerendering.com
Sections 7.9.2 and 18.3
Resources
developer.nvidia.com/object/gpu_programming_guide.html
GeForce 8 Guide: sections 3.4.9, 3.6, and 4.8
GeForce 7 Guide: section 3.6
Resources
http://developer.amd.com/media/gpu_assets/Depth_in-depth.pdf
Depth In-depth
Resources
http://www.graphicshardware.org/previous/www_2000/presentations/ATIHot3D.pdf
ATI Radeon HyperZ Technology
Steve Morein
Resources
http://ati.amd.com/developer/dx9/ATI-DX9_Optimization.pdf
Performance Optimization Techniques for ATI
Graphics Hardware with DirectX® 9.0
Guennadi Riguer
Sections 6.5 and 8
Resources
developer.nvidia.com/object/gpu_gems_home.html
Chapter 28: Graphics Pipeline Performance
Resources
developer.nvidia.com/object/gpu-gems-3.html
Chapter 19: Deferred Shading in Tabula Rasa

Z Buffer Optimizations

Editor's Notes

  • #12 Other Software techniques include Disable depth buffering when it is not needed, e.g. an alpha blended HUD If using multiple depth buffers, allocate the most render-intensive one first
  • #24 RADEON 9500/9700 can achieve up to 24:1 compression rate in extreme cases
  • #31 ATI calls Z-Cull “Hierarchical Z” and NVIDIA calls it “Light Memory Architecture.”