KEMBAR78
Understanding jvm gc advanced | PDF
Jean-Philippe BEMPEL
WebScale
@jpbempel
Understanding
JVM GC
2 •
• GC basics
• G1
• Shenandoah
• Azul’s C4
• ZGC
• How to choose a GC algorithm?
Understanding JVM GC: Advanced!
GC Basics
4 •
Generations
5 •
• Traversing references to mark live objects
• Stopping when reaching old generation
• From GC roots (static fields, thread stack, JNI)
Marking for Minor GC
Young Old
6 •
Card Table for references old -> young references
Write barrier to update card table on assignation
X.f = Y
Card Table
Young 0 0 1
CARD_TABLE[&X >> 9] = 1
mov DWORD PTR [r10+0x6c],r8d
mov r11,r10
shr r11,0x9
mov r8d,0x2383000
mov BYTE PTR [r8+r11*1],r12b
G1
8 •
• Generational
• Region based
• Pause time target (soft real-time)
• -XX:MaxGCPauseMillis=n (default 200)
• Default GC since JDK9
Garbage First
9 •
Heap divided into fixed-size regions
Regions
10 •
Regions
Credit: Kirk Pepperdine
11 •
• Young collection (STW)
• Initial Mark (STW)
• Concurrent Marking
• Final Remark (STW)
• Cleanup (STW)
• Mixed collection (STW)
G1 phases
12 •
• Stop-The-World event
• Evacuates live objects to Survivor or Old regions
• Only objects in young generation are considered
Young GC
13 •
• Card table per region
• Avoid scanning the entire heap
Remembered Sets
14 •
• For each reference assignation (X.f = Y) we need to check:
• References (X & Y) are NOT in the same region
• Y is not null
• => enqueue for Remebered Set processing
• Refinement threads to process the queue
• Additional instructions added after assignation
Remembered Sets: Post Write Barrier
if (!isInSameRegion(X, Y)
&& Y != null)
RSEnqueue(X)
mov DWORD PTR [rbp+0x74],r10d
mov r11,rbp
mov r8,r10
shl r8,0x3
xor r8,r11
shr r8,0x14
test r8,r8
je cont
test r10d,r10d
je cont
shr r11,0x9
movabs rcx,0x2965ecc3000
add rcx,r11
cmp BYTE PTR [rcx],0x20
je cont
mov r10,QWORD PTR [r15+0x70]
mov r11,QWORD PTR [r15+0x80]
lock add DWORD PTR [rsp-0x40],0x0
cmp BYTE PTR [rcx],0x0
je cont
mov BYTE PTR [rcx],0x0
test r10,r10
jne 0x000002965edc62bc
mov rdx,r15
movabs r10,0x7ffac2febc30
call r10
jmp cont
mov QWORD PTR [r11+r10*1-0x8],rcx
add r10,0xfffffffffffffff8
mov QWORD PTR [r15+0x70],r10
15 •
• Triggered based on Initiating Heap Occupancy Percent flag (IHOP default to 45%)
• Try to mark the whole object graph concurrently with the application running
• Based on Tri-color abstraction & Snapshot-At-The-Beginning algorithm
Concurrent Marking
16 •
Concurrent Marking: Tri-Color Abstraction
17 •
Concurrent Marking: Issues
• New allocations during marking phase can be handled by:
• Marking automatically object at allocation
• Not considering new allocations for the current cycle
• Tri-Color abstraction provides 2 properties of missed object:
1. The mutator stores a reference to a white object into a black object.
2. All paths from any gray objects to that white object are destroyed.
http://www.memorymanagement.org/glossary/s.html#term-snapshot-at-the-beginning
18 •
Concurrent Marking: Issues
A
B
C
A.field1 = C;
B.field2 = null;
OOPS!
19 •
• 2 ways to ensure not missing any marking
• For SATB, Pre-Write Barriers, recording object for marking
• SATB barrier is only active when Marking is on (global state)
Concurrent Marking: Resolving misses
if (SATB_WriteBarrier) {
if (X.f != null)
SATB_enqueue(X.f);
}
cmp BYTE PTR [r15+0x30],0x0
jne 0x000002965edc62e5
[...]
mov r11d,DWORD PTR [rbp+0x74]
test r11d,r11d
je 0x000002965edc6253
mov r10,QWORD PTR [r15+0x38]
mov rcx,r11
shl rcx,0x3
test r10,r10
je 0x000002965edc6318
mov r11,QWORD PTR [r15+0x48]
mov QWORD PTR [r11+r10*1-0x8],rcx
add r10,0xfffffffffffffff8
mov QWORD PTR [r15+0x38],r10
jmp 0x000002965edc6253
mov rdx,r15
movabs r10,0x7ffac2febc50
call r10
jmp 0x000002965edc6253
20 •
• At the end of Marking, we have per region liveness information
• Regions are sorted by liveness (ascending)
• Regions full of garbage are collected during cleanup STW phase
• CollectionSet is built based on
• Liveness, up until thresholds (G1HeapWastePercent,
G1MixedGCLiveThresholdPercent)
• Maximum number of regions (G1OldCSetRegionThresholdPercent)
CollectionSet
21 •
• Based on CollectionSet, G1 schedule to collect part of old regions
• When a Young is triggered, old regions to collect are piggy backed
• Not all old regions are considered to not waste time and reach the pause goal
• Several Young GCs can be used to collect old regions (mixed event)
Mixed GC
22 •
Mixed GC
23 •
• Still fallback to FullGC (serial < JDK10)
• Fragmentation can still happen (regions with lot of lived objects)
• Still unpredictable
FullGC
Shenandoah
25 •
• Non-generational (still option for partial collection)
• Region based
• Use Read Barrier: Brooks pointer
• Self-Healing
• Cooperation between mutator threads & GC threads
• Only for concurrent compaction
• Mostly based on G1 but with concurrent compaction
Shenandoah GC
26 •
• Initial Marking (STW)
• Concurrent Marking
• Final Remark (STW)
• Concurrent Cleanup
• Concurrent Evacuation
• Init Update References (STW)
• Concurrent Update References
• Final Update References (STW)
• Concurrent Cleanup
Shenandoah Phases
27 •
• SATB-style (like G1)
• 2 STW pauses for Initial Mark & Final Remark
• Conditional Write Barrier
• To deal with concurrent modification of object graph
Concurrent Marking
28 •
• Same principle than G1:
• Build CollectionSet with Garbage First!
• Evacuate to new regions to release the region for reuse
• Concurrent Evacuation done with the help of:
• 1 Read Barrier : Brooks pointer
• 4 Write Barriers
• Barriers help to keep the to-space invariant:
• All Writes are made into an object in to-space
Concurrent Evacuation
29 •
• All objects have an additional forwarding pointer
• Placed before the regular object
• Dereference the forwarding pointer for each access
• Memory footprint overhead
• Throughput overhead
Brooks pointers
Header
Brooks pointer
mov r13,QWORD PTR [r12+r14*8-0x8]
30 •
Concurrent Copy: GC thread
Header
Brooks pointer
Header
Brooks pointer
From-Space To-Space
GC thread
31 •
Concurrent Copy: Reader threads
Header
Brooks pointer
From-Space To-Space
Reader
thread
Reader
thread
32 •
Concurrent Copy: Writer threads
Header
Brooks pointer
Header
Brooks pointer
From-Space To-Space
Writer
thread
Writer
thread
Header
Brooks pointer
33 •
• Any writes (even primitives) to from-space object needs to be protected
• Exotic barriers:
• acmp (pointer comparison)
• CAS
• clone
Write Barriers
if (evacInProgress
&& inCollectionSet(obj)
&& notCopyYet(obj)) {
evacuateObject(obj)
}
test BYTE PTR [r15+0x3c0],0x2
jne 0x000000000281bcbc
[...]
mov r10d,DWORD PTR [r13+0xc]
test r10d,r10d
je 0x000000000281bc2b
mov r11,QWORD PTR [r15+0x360]
mov rcx,r10
shl rcx,0x3
test r11,r11
je 0x000000000281bd0d
[...]
mov rdx,r15
movabs r10,0x62d1f660
call r10
jmp 0x000000000281bc2b
34 •
• Late memory release
• Only happens when all refs updated (Concurrent Cleanup phase)
• Allocations can overrun the GC
• Failure modes:
• Pacing
• Degenerated GC
• FullGC
Extreme cases
Azul’s C4
36 •
• Generational (young & old)
• Region based (pages)
• Use Read Barrier: Loaded Value Barrier
• Self-Healing
• Cooperation between mutator threads & GC threads
• Pauseless algorithm but implementation requires safepoints
• Pauses are most of the time < 1ms
Continuously Concurrent Compacting Collector
37 •
• Baker-style Barrier
• move objects through forwarding addresses stored aside
• Applied at load time, not when dereferencing
• Ensure C4 invariants:
• Marked Through the current cycle
• Not relocated
• If not => Self-healing process to correct it
• Mark object
• Relocate & correct reference
• Checked for each reference loads
• Benefits from JIT optimization for caching loaded value (registers)
LVB
38 •
• States of objects stored inside reference address => Colored pointers
• NMT bit
• Generation
• Checked against a global expected value during the GC cycle
• Thread local, almost always L1 cache hits
• Register
• Relocated: x86 Implementation use trap from VM memory translation Guest/Host
• Intel EPT
• AMD NPT
LVB
test r9, rax
jne 0x3001443b
mov r10d, dword ptr [rax + 8]
39 •
Virtual Memory vs Physical Memory
Virtual Memory
Physical Memory
0 2^64
0 2^37
40 •
• All phases are fully parallel & concurrent
• No "rush" to finish phases
• No constraint about STW pause to be short
• Physical memory released quickly in relocation phase
• Can be reused for new allocations
• Plenty of virtual space vs physical memory
C4 Phases
41 •
• Mark
• Marking all objects in graph
• Relocation
• Moving objects to release pages
• Remap
• Fixup references in object graph
• Folded with next mark cycle
C4 Phases
42 •
• Incremental Update Marking (vs SATB)
• Single pass
• No final mark/remark
• Self-Healing: Mark object that are not marked for the current cycle
Mark Phase
43 •
Mark Phase: Concurrent Modification
A
B
C
A.field1 = C;
B.field2 = null;
LVB
44 •
• Scanning roots (Static var, Thread stacks, register, JNI handles)
• GC threads scans stalled threads
• Running threads scans their own stack stopping individually at Safepoint
• Scanning object graph like a parallel collector
• Newly allocated objects into new pages, not considered for reclaim (relocation)
• For each page, summing live data bytes, used to select page to reclaim
Mark Phase
45 •
• Select pages with the greatest number of dead objects (garbage first!)
• Protect page selected from being accessed by mutators thread
• Move objects to new allocated pages
• Build side arrays (off heap tables) for forwarding information
• Self-Healing: As protected, LVB will trigger a trap to:
• Copy object to the new location if not done
• Use forward pointer to fix the reference
Relocation Phase
46 •
Virtual
Physical
Relocation Phase
Forwarding table
47 •
• Few chances mutators stall on accessing a ref as processing mostly dead pages
• Once object copy done, physical memory is released (Quick Release)
• Can be immediately reused (remapped) to satisfy new allocations
• Pages evacuated are still mapped & protected to help remap phase
• Cannot be released until all objects are remapped
• Not a problem as we have a huge virtual address space
Relocation Phase
48 •
• Traverse Object Graph and fixup references
• Execute LVB barrier for each object
• Self-Healing: fixup references using forward information
• As we traverse again, mark for the next phase
• Mark & Remap phases are folded!
Remap Phase
49 •
• Algorithm requires a sustainable rate or remapping operations
• Linux limitations:
• TLB invalidation
• Only 4KB pages can be remapped
• Single threaded remapping (write lock)
• Kernel module implements API for the Zing JVM to increase significantly the remapping rate
• Implements also virtual address aliasing for addressing objects with metadata
Remap – Kernel module
50 •
• Young & Old collections done by same algorithm and can be concurrent
• Size of the generation are dynamically adjusted
• Card Marking with write barrier (Stored Value Barrier)
• Old collection is based on young-to-old roots generated by previous young cycle
• Young collection will perform card scanning per page
• hold an eventual concurrent Old collection per page scanned
Generational
51 •
• Used by Hadoop Name Node
• 580GB Heap
• Very hard to tune with G1
• No issue so far regarding GC since production roll out (Oct 2017)
C4 @ Criteo
Z GC
53 •
• Non generational
• Region based (zPages, dynamically sized)
• Concurrent Marking, Compaction, Ref processing
• Use Colored Pointers & Read/Load Barrier
• Self-Healing
• Cooperation between mutator threads & GC threads
• Experimental in JDK 11 (-XX:+UnlockExperimentalVMOptions –XX:+UseZGC)
Z GC
mov r10,QWORD PTR [r11+0xb0]
test QWORD PTR [r15+0x20],r10
jne 0x00007f9594cc54b5
54 •
Z GC
55 •
• Initial Mark (STW)
• Concurrent Mark/Remap
• Final Mark (STW)
• Concurrent Prepare for Relocation
• Start Relocate (STW)
• Concurrent Relocate
Z GC phases:
56 •
• Store metadata in unused bits of reference address
• 42 bits for addressing (4TB)
• 4 bits for metadata
• Marked0
• Marked1
• Remapped
• Finalizable
Colored Pointers
57 •
• Colored pointers needs to be unmasked for dereferencing
• Some HW support masking (SPARC, Aarch64))
• On linux/windows, overhead if done with classical instructions
• Only one view is active at any point
• Plenty of Virtual Space
Multi-Mapping
58 •
Multi-Mapping
Virtual Memory
Physical Memory
0 2^64
0 2^37
(marked0)
001<address>
(marked1)
010<address>
(remapped)
100<address>
59 •
• Pages are multiple of 2MB
• 3 different groups
• Small: 2MB pages with object size <= 256KB
• Medium: 32MB pages with object size <= 4MB
• Large: 2MB pages, objects span over multiple of them
• Objects in Large group are meant to not to be relocated (too expensive)
Page Allocations
60 •
• Handling remapping
• C4: Memory protection + trap
• Z: mask in colored pointer
• Unmasking ref addresses
• C4: Kernel module aliasing
• Z: Multi-mapping or HW support
• Pages & Relocation
• C4:
• Page are fixed to match OS size (mem protection)
• relocation for large objects by remapping
• Z:
• zPages are dynamic, a zPage can be 100MB large
• No relocation for large objects
Difference between C4 & Z GC
How to choose a GC algorithm
62 •
• Case 1:
• Need maximum of work done in a time frame (offline job)
• Can afford FullGC of several seconds
 Use a throughput collector like ParalleGC or G1
• Case 2:
• Have time constraint per unit of work (online job)
• Cannot afford FullGC of several seconds
 Use a low latency collector like C4, Shenandoah or Z
Throughput vs Latency
63 •
• You have to run on Windows
• Shenandoah
• Battlefield tested GC (maturity)
• C4
• Shenandoah
• Minimizing any kind of JVM pauses
• C4
• Z
• You don’t want pay for it:
• Shenandoah
• Z
Low latency GCs
References
65 •
• Java Garbage Collection distilled by Martin Thompson
• The Java GC mini book
• Oracle’s white paper on JVM memory management & GC
• What differences JVM makes by Nitsan Wakart
• Memory Management Reference
• IBM Pause-Less GC
References GC Basics
66 •
• Garbage-First Garbage Collection (2004)
• G1 One Garbage Collector to rule them all by Monica Beckwith
• Tips for Tuning The G1 GC by Monica Beckwith
• G1 Garbage Collector Details and Tuning by Simone Bordet
• Write Barriers in Garbage-First Garbage Collector by Monica Beckwith
References G1
67 •
• Shenandoah: An open-source concurrent compacting garbage collector for OpenJDK
• Shenandoah: The Garbage Collector That Could by Aleksey Shipilev
• Shenandoah GC Wiki
References Shenandoah
68 •
• The Pauseless GC algorithm (2005)
• C4: Continuously Concurrent Compacting Collector (2011)
• Azul GC in Detail by Charles Humble
• 2010 version source code
References C4
69 •
• ZGC - Low Latency GC for OpenJDK by Per Liden
• Java's new Z Garbage Collector (ZGC) is very exciting by Richard Warburton
• A first look into ZGC by Dominik Inführ
• Architectural Comparison with C4/Pauseless
References ZGC
Thank You!
@jpbempel

Understanding jvm gc advanced

  • 1.
  • 2.
    2 • • GCbasics • G1 • Shenandoah • Azul’s C4 • ZGC • How to choose a GC algorithm? Understanding JVM GC: Advanced!
  • 3.
  • 4.
  • 5.
    5 • • Traversingreferences to mark live objects • Stopping when reaching old generation • From GC roots (static fields, thread stack, JNI) Marking for Minor GC Young Old
  • 6.
    6 • Card Tablefor references old -> young references Write barrier to update card table on assignation X.f = Y Card Table Young 0 0 1 CARD_TABLE[&X >> 9] = 1 mov DWORD PTR [r10+0x6c],r8d mov r11,r10 shr r11,0x9 mov r8d,0x2383000 mov BYTE PTR [r8+r11*1],r12b
  • 7.
  • 8.
    8 • • Generational •Region based • Pause time target (soft real-time) • -XX:MaxGCPauseMillis=n (default 200) • Default GC since JDK9 Garbage First
  • 9.
    9 • Heap dividedinto fixed-size regions Regions
  • 10.
  • 11.
    11 • • Youngcollection (STW) • Initial Mark (STW) • Concurrent Marking • Final Remark (STW) • Cleanup (STW) • Mixed collection (STW) G1 phases
  • 12.
    12 • • Stop-The-Worldevent • Evacuates live objects to Survivor or Old regions • Only objects in young generation are considered Young GC
  • 13.
    13 • • Cardtable per region • Avoid scanning the entire heap Remembered Sets
  • 14.
    14 • • Foreach reference assignation (X.f = Y) we need to check: • References (X & Y) are NOT in the same region • Y is not null • => enqueue for Remebered Set processing • Refinement threads to process the queue • Additional instructions added after assignation Remembered Sets: Post Write Barrier if (!isInSameRegion(X, Y) && Y != null) RSEnqueue(X) mov DWORD PTR [rbp+0x74],r10d mov r11,rbp mov r8,r10 shl r8,0x3 xor r8,r11 shr r8,0x14 test r8,r8 je cont test r10d,r10d je cont shr r11,0x9 movabs rcx,0x2965ecc3000 add rcx,r11 cmp BYTE PTR [rcx],0x20 je cont mov r10,QWORD PTR [r15+0x70] mov r11,QWORD PTR [r15+0x80] lock add DWORD PTR [rsp-0x40],0x0 cmp BYTE PTR [rcx],0x0 je cont mov BYTE PTR [rcx],0x0 test r10,r10 jne 0x000002965edc62bc mov rdx,r15 movabs r10,0x7ffac2febc30 call r10 jmp cont mov QWORD PTR [r11+r10*1-0x8],rcx add r10,0xfffffffffffffff8 mov QWORD PTR [r15+0x70],r10
  • 15.
    15 • • Triggeredbased on Initiating Heap Occupancy Percent flag (IHOP default to 45%) • Try to mark the whole object graph concurrently with the application running • Based on Tri-color abstraction & Snapshot-At-The-Beginning algorithm Concurrent Marking
  • 16.
    16 • Concurrent Marking:Tri-Color Abstraction
  • 17.
    17 • Concurrent Marking:Issues • New allocations during marking phase can be handled by: • Marking automatically object at allocation • Not considering new allocations for the current cycle • Tri-Color abstraction provides 2 properties of missed object: 1. The mutator stores a reference to a white object into a black object. 2. All paths from any gray objects to that white object are destroyed. http://www.memorymanagement.org/glossary/s.html#term-snapshot-at-the-beginning
  • 18.
    18 • Concurrent Marking:Issues A B C A.field1 = C; B.field2 = null; OOPS!
  • 19.
    19 • • 2ways to ensure not missing any marking • For SATB, Pre-Write Barriers, recording object for marking • SATB barrier is only active when Marking is on (global state) Concurrent Marking: Resolving misses if (SATB_WriteBarrier) { if (X.f != null) SATB_enqueue(X.f); } cmp BYTE PTR [r15+0x30],0x0 jne 0x000002965edc62e5 [...] mov r11d,DWORD PTR [rbp+0x74] test r11d,r11d je 0x000002965edc6253 mov r10,QWORD PTR [r15+0x38] mov rcx,r11 shl rcx,0x3 test r10,r10 je 0x000002965edc6318 mov r11,QWORD PTR [r15+0x48] mov QWORD PTR [r11+r10*1-0x8],rcx add r10,0xfffffffffffffff8 mov QWORD PTR [r15+0x38],r10 jmp 0x000002965edc6253 mov rdx,r15 movabs r10,0x7ffac2febc50 call r10 jmp 0x000002965edc6253
  • 20.
    20 • • Atthe end of Marking, we have per region liveness information • Regions are sorted by liveness (ascending) • Regions full of garbage are collected during cleanup STW phase • CollectionSet is built based on • Liveness, up until thresholds (G1HeapWastePercent, G1MixedGCLiveThresholdPercent) • Maximum number of regions (G1OldCSetRegionThresholdPercent) CollectionSet
  • 21.
    21 • • Basedon CollectionSet, G1 schedule to collect part of old regions • When a Young is triggered, old regions to collect are piggy backed • Not all old regions are considered to not waste time and reach the pause goal • Several Young GCs can be used to collect old regions (mixed event) Mixed GC
  • 22.
  • 23.
    23 • • Stillfallback to FullGC (serial < JDK10) • Fragmentation can still happen (regions with lot of lived objects) • Still unpredictable FullGC
  • 24.
  • 25.
    25 • • Non-generational(still option for partial collection) • Region based • Use Read Barrier: Brooks pointer • Self-Healing • Cooperation between mutator threads & GC threads • Only for concurrent compaction • Mostly based on G1 but with concurrent compaction Shenandoah GC
  • 26.
    26 • • InitialMarking (STW) • Concurrent Marking • Final Remark (STW) • Concurrent Cleanup • Concurrent Evacuation • Init Update References (STW) • Concurrent Update References • Final Update References (STW) • Concurrent Cleanup Shenandoah Phases
  • 27.
    27 • • SATB-style(like G1) • 2 STW pauses for Initial Mark & Final Remark • Conditional Write Barrier • To deal with concurrent modification of object graph Concurrent Marking
  • 28.
    28 • • Sameprinciple than G1: • Build CollectionSet with Garbage First! • Evacuate to new regions to release the region for reuse • Concurrent Evacuation done with the help of: • 1 Read Barrier : Brooks pointer • 4 Write Barriers • Barriers help to keep the to-space invariant: • All Writes are made into an object in to-space Concurrent Evacuation
  • 29.
    29 • • Allobjects have an additional forwarding pointer • Placed before the regular object • Dereference the forwarding pointer for each access • Memory footprint overhead • Throughput overhead Brooks pointers Header Brooks pointer mov r13,QWORD PTR [r12+r14*8-0x8]
  • 30.
    30 • Concurrent Copy:GC thread Header Brooks pointer Header Brooks pointer From-Space To-Space GC thread
  • 31.
    31 • Concurrent Copy:Reader threads Header Brooks pointer From-Space To-Space Reader thread Reader thread
  • 32.
    32 • Concurrent Copy:Writer threads Header Brooks pointer Header Brooks pointer From-Space To-Space Writer thread Writer thread Header Brooks pointer
  • 33.
    33 • • Anywrites (even primitives) to from-space object needs to be protected • Exotic barriers: • acmp (pointer comparison) • CAS • clone Write Barriers if (evacInProgress && inCollectionSet(obj) && notCopyYet(obj)) { evacuateObject(obj) } test BYTE PTR [r15+0x3c0],0x2 jne 0x000000000281bcbc [...] mov r10d,DWORD PTR [r13+0xc] test r10d,r10d je 0x000000000281bc2b mov r11,QWORD PTR [r15+0x360] mov rcx,r10 shl rcx,0x3 test r11,r11 je 0x000000000281bd0d [...] mov rdx,r15 movabs r10,0x62d1f660 call r10 jmp 0x000000000281bc2b
  • 34.
    34 • • Latememory release • Only happens when all refs updated (Concurrent Cleanup phase) • Allocations can overrun the GC • Failure modes: • Pacing • Degenerated GC • FullGC Extreme cases
  • 35.
  • 36.
    36 • • Generational(young & old) • Region based (pages) • Use Read Barrier: Loaded Value Barrier • Self-Healing • Cooperation between mutator threads & GC threads • Pauseless algorithm but implementation requires safepoints • Pauses are most of the time < 1ms Continuously Concurrent Compacting Collector
  • 37.
    37 • • Baker-styleBarrier • move objects through forwarding addresses stored aside • Applied at load time, not when dereferencing • Ensure C4 invariants: • Marked Through the current cycle • Not relocated • If not => Self-healing process to correct it • Mark object • Relocate & correct reference • Checked for each reference loads • Benefits from JIT optimization for caching loaded value (registers) LVB
  • 38.
    38 • • Statesof objects stored inside reference address => Colored pointers • NMT bit • Generation • Checked against a global expected value during the GC cycle • Thread local, almost always L1 cache hits • Register • Relocated: x86 Implementation use trap from VM memory translation Guest/Host • Intel EPT • AMD NPT LVB test r9, rax jne 0x3001443b mov r10d, dword ptr [rax + 8]
  • 39.
    39 • Virtual Memoryvs Physical Memory Virtual Memory Physical Memory 0 2^64 0 2^37
  • 40.
    40 • • Allphases are fully parallel & concurrent • No "rush" to finish phases • No constraint about STW pause to be short • Physical memory released quickly in relocation phase • Can be reused for new allocations • Plenty of virtual space vs physical memory C4 Phases
  • 41.
    41 • • Mark •Marking all objects in graph • Relocation • Moving objects to release pages • Remap • Fixup references in object graph • Folded with next mark cycle C4 Phases
  • 42.
    42 • • IncrementalUpdate Marking (vs SATB) • Single pass • No final mark/remark • Self-Healing: Mark object that are not marked for the current cycle Mark Phase
  • 43.
    43 • Mark Phase:Concurrent Modification A B C A.field1 = C; B.field2 = null; LVB
  • 44.
    44 • • Scanningroots (Static var, Thread stacks, register, JNI handles) • GC threads scans stalled threads • Running threads scans their own stack stopping individually at Safepoint • Scanning object graph like a parallel collector • Newly allocated objects into new pages, not considered for reclaim (relocation) • For each page, summing live data bytes, used to select page to reclaim Mark Phase
  • 45.
    45 • • Selectpages with the greatest number of dead objects (garbage first!) • Protect page selected from being accessed by mutators thread • Move objects to new allocated pages • Build side arrays (off heap tables) for forwarding information • Self-Healing: As protected, LVB will trigger a trap to: • Copy object to the new location if not done • Use forward pointer to fix the reference Relocation Phase
  • 46.
  • 47.
    47 • • Fewchances mutators stall on accessing a ref as processing mostly dead pages • Once object copy done, physical memory is released (Quick Release) • Can be immediately reused (remapped) to satisfy new allocations • Pages evacuated are still mapped & protected to help remap phase • Cannot be released until all objects are remapped • Not a problem as we have a huge virtual address space Relocation Phase
  • 48.
    48 • • TraverseObject Graph and fixup references • Execute LVB barrier for each object • Self-Healing: fixup references using forward information • As we traverse again, mark for the next phase • Mark & Remap phases are folded! Remap Phase
  • 49.
    49 • • Algorithmrequires a sustainable rate or remapping operations • Linux limitations: • TLB invalidation • Only 4KB pages can be remapped • Single threaded remapping (write lock) • Kernel module implements API for the Zing JVM to increase significantly the remapping rate • Implements also virtual address aliasing for addressing objects with metadata Remap – Kernel module
  • 50.
    50 • • Young& Old collections done by same algorithm and can be concurrent • Size of the generation are dynamically adjusted • Card Marking with write barrier (Stored Value Barrier) • Old collection is based on young-to-old roots generated by previous young cycle • Young collection will perform card scanning per page • hold an eventual concurrent Old collection per page scanned Generational
  • 51.
    51 • • Usedby Hadoop Name Node • 580GB Heap • Very hard to tune with G1 • No issue so far regarding GC since production roll out (Oct 2017) C4 @ Criteo
  • 52.
  • 53.
    53 • • Nongenerational • Region based (zPages, dynamically sized) • Concurrent Marking, Compaction, Ref processing • Use Colored Pointers & Read/Load Barrier • Self-Healing • Cooperation between mutator threads & GC threads • Experimental in JDK 11 (-XX:+UnlockExperimentalVMOptions –XX:+UseZGC) Z GC mov r10,QWORD PTR [r11+0xb0] test QWORD PTR [r15+0x20],r10 jne 0x00007f9594cc54b5
  • 54.
  • 55.
    55 • • InitialMark (STW) • Concurrent Mark/Remap • Final Mark (STW) • Concurrent Prepare for Relocation • Start Relocate (STW) • Concurrent Relocate Z GC phases:
  • 56.
    56 • • Storemetadata in unused bits of reference address • 42 bits for addressing (4TB) • 4 bits for metadata • Marked0 • Marked1 • Remapped • Finalizable Colored Pointers
  • 57.
    57 • • Coloredpointers needs to be unmasked for dereferencing • Some HW support masking (SPARC, Aarch64)) • On linux/windows, overhead if done with classical instructions • Only one view is active at any point • Plenty of Virtual Space Multi-Mapping
  • 58.
    58 • Multi-Mapping Virtual Memory PhysicalMemory 0 2^64 0 2^37 (marked0) 001<address> (marked1) 010<address> (remapped) 100<address>
  • 59.
    59 • • Pagesare multiple of 2MB • 3 different groups • Small: 2MB pages with object size <= 256KB • Medium: 32MB pages with object size <= 4MB • Large: 2MB pages, objects span over multiple of them • Objects in Large group are meant to not to be relocated (too expensive) Page Allocations
  • 60.
    60 • • Handlingremapping • C4: Memory protection + trap • Z: mask in colored pointer • Unmasking ref addresses • C4: Kernel module aliasing • Z: Multi-mapping or HW support • Pages & Relocation • C4: • Page are fixed to match OS size (mem protection) • relocation for large objects by remapping • Z: • zPages are dynamic, a zPage can be 100MB large • No relocation for large objects Difference between C4 & Z GC
  • 61.
    How to choosea GC algorithm
  • 62.
    62 • • Case1: • Need maximum of work done in a time frame (offline job) • Can afford FullGC of several seconds  Use a throughput collector like ParalleGC or G1 • Case 2: • Have time constraint per unit of work (online job) • Cannot afford FullGC of several seconds  Use a low latency collector like C4, Shenandoah or Z Throughput vs Latency
  • 63.
    63 • • Youhave to run on Windows • Shenandoah • Battlefield tested GC (maturity) • C4 • Shenandoah • Minimizing any kind of JVM pauses • C4 • Z • You don’t want pay for it: • Shenandoah • Z Low latency GCs
  • 64.
  • 65.
    65 • • JavaGarbage Collection distilled by Martin Thompson • The Java GC mini book • Oracle’s white paper on JVM memory management & GC • What differences JVM makes by Nitsan Wakart • Memory Management Reference • IBM Pause-Less GC References GC Basics
  • 66.
    66 • • Garbage-FirstGarbage Collection (2004) • G1 One Garbage Collector to rule them all by Monica Beckwith • Tips for Tuning The G1 GC by Monica Beckwith • G1 Garbage Collector Details and Tuning by Simone Bordet • Write Barriers in Garbage-First Garbage Collector by Monica Beckwith References G1
  • 67.
    67 • • Shenandoah:An open-source concurrent compacting garbage collector for OpenJDK • Shenandoah: The Garbage Collector That Could by Aleksey Shipilev • Shenandoah GC Wiki References Shenandoah
  • 68.
    68 • • ThePauseless GC algorithm (2005) • C4: Continuously Concurrent Compacting Collector (2011) • Azul GC in Detail by Charles Humble • 2010 version source code References C4
  • 69.
    69 • • ZGC- Low Latency GC for OpenJDK by Per Liden • Java's new Z Garbage Collector (ZGC) is very exciting by Richard Warburton • A first look into ZGC by Dominik Inführ • Architectural Comparison with C4/Pauseless References ZGC
  • 70.