KEMBAR78
Building a Better JVM | PPTX
Java At Speed:
Building A Better JVM
12 August 2020
Presentedby SimonRitter, Deputy CTO
Azul Systems,Inc.
2
Speed In The Java World
JVM Performance Graph: Ideal
3
JVM Performance Graph: Reality
4
Bytecodes
interpreted
C1 JIT plus
profiling
C2 JIT with
Deoptimisations
Steady optimised
state GC pauses
Managed runtime environment
1. The Garbage Collector
‒ Inherently non-deterministic
‒ Pause times can be big
2. Bytecodes, not machine code
‒ Adaptive compilation strategies
‒ Speed of code ‘warm-up’
Big JVM Challenges
5
What If There Was A Better JVM?
6
• Based on OpenJDK source code
• Passes all Java SE TCK/JCK tests
‒ Drop in replacement for other JVMs
• Hotspot collectors replaced with C4
• Works in conjunction with Zing System Tools
‒ Only supported on Linux
• Falcon JIT compiler
‒ C2 replacement
• ReadyNow! warm up elimination technology
Azul Zing JVM
7
• Enables better memory management for JVM
• Memory freed by JVM is returned to kernel
• Allocation of new blocks comes from kernel
‒ ZST knows cache status
‒ Newly allocated blocks for TLAB are ‘hot’
‒ Not like standard JVM
• Other clever tricks
Zing System Tools
8
Azul Continuous Concurrent
Compacting Collector (C4)
• Generational (young and old)
‒ Uses the same GC collector for both
‒ For efficiency rather than pause containment
• Concurrent, parallel and compacting
• No STW compacting fallback
• Algorithm is mark, relocate, remap
C4 Basics
10
• Read barrier
‒ Tests all object references as they are loaded
• Enforces two invariants
‒ Reference is marked through
‒ Reference points to correct object position
• Allows for concurrent marking and relocation
• Minimal performance overhead
‒ Test and jump (2 instructions)
‒ x86 architecture reduces this to one micro-op
Loaded Value Barrier
11
Concurrent Mark Phase
12
Root Set
GC Threads
App Threads
X
X
X
X
X
Relocation Phase
13
Compaction
A B C D E
A’ B’ C’ D’ E’
A -> A’ B -> B’ C -> C’ D -> D’ E -> E’
Quick Release
14
A -> A’ B -> B’ C -> C’ D -> D’ E -> E’
PHYSICAL
VIRTUAL
Remapping Phase
App Threads
GC Threads
A -> A’ B -> B’ C -> C’ D -> D’ E -> E’
X
X
X
• Scales to 20Tb heap
‒ No degradation in pause times
• Use one big heap, rather than many small heaps
‒ Less JVMs means more efficiency
• Zing does not require big heaps
‒ 512 Mb minimum
Zing: Big Heaps, No Problem
16
GC Tuning
Non-Zing GC Tuning Options
GC Tuning Used To Be Hard
Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g
-XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0
-XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled
-XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12
-XX:LargePageSizeInBytes=256m …
Java –Xms8g –Xmx8g –Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M
-XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2
-XX:-UseAdaptiveSizePolicy -XX:+UseConcMarkSweepGC
-XX:+CMSConcurrentMTEnabled -XX:+CMSParallelRemarkEnabled
-XX:+CMSParallelSurvivorRemarkEnabled
-XX:CMSMaxAbortablePrecleanTime=10000
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC –Xnoclassgc …
GC Tuning Used To Be Hard
Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g
-XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0
-XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled
-XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12
-XX:LargePageSizeInBytes=256m …
Java –Xms8g –Xmx8g –Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M
-XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2
-XX:-UseAdaptiveSizePolicy -XX:+UseConcMarkSweepGC
-XX:+CMSConcurrentMTEnabled -XX:+CMSParallelRemarkEnabled
-XX:+CMSParallelSurvivorRemarkEnabled
-XX:CMSMaxAbortablePrecleanTime=10000
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC –Xnoclassgc …
GC Tuning With Zing
java -Xmx64g
java -Xmx48g
java -Xmx52g
• jHiccup
• Spends most of its time asleep
‒ Minimal effect on perfomance
‒ Wakes every 1 ms
‒ Records delta of time it expects to wake up
‒ Measured effect is what would be experienced by your application
• Generates histogram log files
‒ These can be graphed for easy evaluation
Measuring Platform Performance
22
Big Heap, Small Latency
23
HotSpot Zing
Elasticsearch with 128Gb heap
Big Heap, Small Latency
24
HotSpot Zing
Elasticsearch with 128Gb heap
Small Heap, Small Latency
25
HotSpot Zing
Hazelcast with 1Gb heap
Azul Falcon JIT Compiler
Adaptive Compilation Challenges
• Traditionally three options for running bytecodes
‒ Fully interpreted
‒ C1 (Client compiler):
 Fast warmup, lower optimal level
‒ C2 (Server compiler):
 Slower warmup, higher optimal level
• Application takes time from starting to optimal level of performance
27
Advancing Adaptive Compilation
• Replacement for C2 compiler
• Azul Falcon JIT compiler
‒ Based on latest compiler research
‒ LLVM project
• Better performance
‒ Better intrinsics
‒ More inlining
‒ Fewer compiler excludes
Simple Code Example
• Simple array summing loop
‒ A modern compiler will use vector operations for this
29
More Complex Code Example
• Conditional array cell addition loop
‒ Hard for compiler to identify for vector instruction use
30
Traditional JVM JIT
31
Per element jumps
2 elements per iteration
Falcon JIT
Using AVX2 vector instructions
32 elements per iteration
Broadwell E5-2690-v4
ReadyNow! Warmup Elimination
33
• Save JVM JIT profiling information
‒ Classes loaded
‒ Classes initialised
‒ Instruction profiling data
‒ Speculative optimisation failure data
• Data can be gathered over much longer period
‒ JVM/JIT profiles quickly
‒ Significant reduction in deoptimisations
• Able to load, initialise and compile most code before main()
Effect Of ReadyNow!
Customer application
ReadyNow! Start Up Time
Performance
Time
Performance
Time
Without ReadyNow!
With ReadyNow!
Class loading, initialising
and compile time
Falcon Pipeline
Zing JVM
Bytecode
frontend
LLVM
LLVM IR
VM
callbacks
Queries
Responses
Compiled
methods
Machine code
Deterministic Compiler
Method for compilation
Initial IR
(Method bytecodes & live profile)
Queries and responses
Produced machine code
Given identical input
Guarantees identical output
Add Compile Stashing
Zing JVM
Bytecode
frontend
LLVM
LLVM IR
VM
callbacks
Queries
Responses
Compiled
methods
Machine
code
Compile
Stash
Compile Stashing Effect
Performance
Time
Performance
Time
Without Compile Stashing
With Compile Stashing
Up to 80% reduction in compile time
and 60% reduction in CPU load
Summary
• Start fast
• Go faster
• Stay fast
• Simple replacement for other JVMs
‒ No recoding necessary
The Zing JVM
41
Try Zing free for 30 days:
azul.com/zingtrial
Thank You.
Simon Ritter, Deputy CTO
s.ritter@azul.com
1.650.230.6600
@speakjava

Building a Better JVM

  • 1.
    Java At Speed: BuildingA Better JVM 12 August 2020 Presentedby SimonRitter, Deputy CTO Azul Systems,Inc.
  • 2.
    2 Speed In TheJava World
  • 3.
  • 4.
    JVM Performance Graph:Reality 4 Bytecodes interpreted C1 JIT plus profiling C2 JIT with Deoptimisations Steady optimised state GC pauses
  • 5.
    Managed runtime environment 1.The Garbage Collector ‒ Inherently non-deterministic ‒ Pause times can be big 2. Bytecodes, not machine code ‒ Adaptive compilation strategies ‒ Speed of code ‘warm-up’ Big JVM Challenges 5
  • 6.
    What If ThereWas A Better JVM? 6
  • 7.
    • Based onOpenJDK source code • Passes all Java SE TCK/JCK tests ‒ Drop in replacement for other JVMs • Hotspot collectors replaced with C4 • Works in conjunction with Zing System Tools ‒ Only supported on Linux • Falcon JIT compiler ‒ C2 replacement • ReadyNow! warm up elimination technology Azul Zing JVM 7
  • 8.
    • Enables bettermemory management for JVM • Memory freed by JVM is returned to kernel • Allocation of new blocks comes from kernel ‒ ZST knows cache status ‒ Newly allocated blocks for TLAB are ‘hot’ ‒ Not like standard JVM • Other clever tricks Zing System Tools 8
  • 9.
  • 10.
    • Generational (youngand old) ‒ Uses the same GC collector for both ‒ For efficiency rather than pause containment • Concurrent, parallel and compacting • No STW compacting fallback • Algorithm is mark, relocate, remap C4 Basics 10
  • 11.
    • Read barrier ‒Tests all object references as they are loaded • Enforces two invariants ‒ Reference is marked through ‒ Reference points to correct object position • Allows for concurrent marking and relocation • Minimal performance overhead ‒ Test and jump (2 instructions) ‒ x86 architecture reduces this to one micro-op Loaded Value Barrier 11
  • 12.
    Concurrent Mark Phase 12 RootSet GC Threads App Threads X X X X X
  • 13.
    Relocation Phase 13 Compaction A BC D E A’ B’ C’ D’ E’ A -> A’ B -> B’ C -> C’ D -> D’ E -> E’
  • 14.
    Quick Release 14 A ->A’ B -> B’ C -> C’ D -> D’ E -> E’ PHYSICAL VIRTUAL
  • 15.
    Remapping Phase App Threads GCThreads A -> A’ B -> B’ C -> C’ D -> D’ E -> E’ X X X
  • 16.
    • Scales to20Tb heap ‒ No degradation in pause times • Use one big heap, rather than many small heaps ‒ Less JVMs means more efficiency • Zing does not require big heaps ‒ 512 Mb minimum Zing: Big Heaps, No Problem 16
  • 17.
  • 18.
  • 19.
    GC Tuning UsedTo Be Hard Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g -XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0 -XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12 -XX:LargePageSizeInBytes=256m … Java –Xms8g –Xmx8g –Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M -XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2 -XX:-UseAdaptiveSizePolicy -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled -XX:CMSMaxAbortablePrecleanTime=10000 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC –Xnoclassgc …
  • 20.
    GC Tuning UsedTo Be Hard Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g -XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0 -XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12 -XX:LargePageSizeInBytes=256m … Java –Xms8g –Xmx8g –Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M -XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2 -XX:-UseAdaptiveSizePolicy -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled -XX:CMSMaxAbortablePrecleanTime=10000 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC –Xnoclassgc …
  • 21.
    GC Tuning WithZing java -Xmx64g java -Xmx48g java -Xmx52g
  • 22.
    • jHiccup • Spendsmost of its time asleep ‒ Minimal effect on perfomance ‒ Wakes every 1 ms ‒ Records delta of time it expects to wake up ‒ Measured effect is what would be experienced by your application • Generates histogram log files ‒ These can be graphed for easy evaluation Measuring Platform Performance 22
  • 23.
    Big Heap, SmallLatency 23 HotSpot Zing Elasticsearch with 128Gb heap
  • 24.
    Big Heap, SmallLatency 24 HotSpot Zing Elasticsearch with 128Gb heap
  • 25.
    Small Heap, SmallLatency 25 HotSpot Zing Hazelcast with 1Gb heap
  • 26.
  • 27.
    Adaptive Compilation Challenges •Traditionally three options for running bytecodes ‒ Fully interpreted ‒ C1 (Client compiler):  Fast warmup, lower optimal level ‒ C2 (Server compiler):  Slower warmup, higher optimal level • Application takes time from starting to optimal level of performance 27
  • 28.
    Advancing Adaptive Compilation •Replacement for C2 compiler • Azul Falcon JIT compiler ‒ Based on latest compiler research ‒ LLVM project • Better performance ‒ Better intrinsics ‒ More inlining ‒ Fewer compiler excludes
  • 29.
    Simple Code Example •Simple array summing loop ‒ A modern compiler will use vector operations for this 29
  • 30.
    More Complex CodeExample • Conditional array cell addition loop ‒ Hard for compiler to identify for vector instruction use 30
  • 31.
    Traditional JVM JIT 31 Perelement jumps 2 elements per iteration
  • 32.
    Falcon JIT Using AVX2vector instructions 32 elements per iteration Broadwell E5-2690-v4
  • 33.
    ReadyNow! Warmup Elimination 33 •Save JVM JIT profiling information ‒ Classes loaded ‒ Classes initialised ‒ Instruction profiling data ‒ Speculative optimisation failure data • Data can be gathered over much longer period ‒ JVM/JIT profiles quickly ‒ Significant reduction in deoptimisations • Able to load, initialise and compile most code before main()
  • 34.
  • 35.
    ReadyNow! Start UpTime Performance Time Performance Time Without ReadyNow! With ReadyNow! Class loading, initialising and compile time
  • 36.
    Falcon Pipeline Zing JVM Bytecode frontend LLVM LLVMIR VM callbacks Queries Responses Compiled methods Machine code
  • 37.
    Deterministic Compiler Method forcompilation Initial IR (Method bytecodes & live profile) Queries and responses Produced machine code Given identical input Guarantees identical output
  • 38.
    Add Compile Stashing ZingJVM Bytecode frontend LLVM LLVM IR VM callbacks Queries Responses Compiled methods Machine code Compile Stash
  • 39.
    Compile Stashing Effect Performance Time Performance Time WithoutCompile Stashing With Compile Stashing Up to 80% reduction in compile time and 60% reduction in CPU load
  • 40.
  • 41.
    • Start fast •Go faster • Stay fast • Simple replacement for other JVMs ‒ No recoding necessary The Zing JVM 41 Try Zing free for 30 days: azul.com/zingtrial
  • 42.
    Thank You. Simon Ritter,Deputy CTO s.ritter@azul.com 1.650.230.6600 @speakjava