Building a Better JVM

Java At Speed:
Building A Better JVM
12 August 2020
Presentedby SimonRitter, Deputy CTO
Azul Systems,Inc.

JVM Performance Graph: Ideal
3

JVM Performance Graph: Reality
4
Bytecodes
interpreted
C1 JIT plus
profiling
C2 JIT with
Deoptimisations
Steady optimised
state GC pauses

Managed runtime environment
1. The Garbage Collector
‒ Inherently non-deterministic
‒ Pause times can be big
2. Bytecodes, not machine code
‒ Adaptive compilation strategies
‒ Speed of code ‘warm-up’
Big JVM Challenges
5

What If There Was A Better JVM?
6

• Based on OpenJDK source code
• Passes all Java SE TCK/JCK tests
‒ Drop in replacement for other JVMs
• Hotspot collectors replaced with C4
• Works in conjunction with Zing System Tools
‒ Only supported on Linux
• Falcon JIT compiler
‒ C2 replacement
• ReadyNow! warm up elimination technology
Azul Zing JVM
7

• Enables better memory management for JVM
• Memory freed by JVM is returned to kernel
• Allocation of new blocks comes from kernel
‒ ZST knows cache status
‒ Newly allocated blocks for TLAB are ‘hot’
‒ Not like standard JVM
• Other clever tricks
Zing System Tools
8

Azul Continuous Concurrent
Compacting Collector (C4)

• Generational (young and old)
‒ Uses the same GC collector for both
‒ For efficiency rather than pause containment
• Concurrent, parallel and compacting
• No STW compacting fallback
• Algorithm is mark, relocate, remap
C4 Basics
10

• Read barrier
‒ Tests all object references as they are loaded
• Enforces two invariants
‒ Reference is marked through
‒ Reference points to correct object position
• Allows for concurrent marking and relocation
• Minimal performance overhead
‒ Test and jump (2 instructions)
‒ x86 architecture reduces this to one micro-op
Loaded Value Barrier
11

Concurrent Mark Phase
12
Root Set
GC Threads
App Threads
X
X
X
X
X

Relocation Phase
13
Compaction
A B C D E
A’ B’ C’ D’ E’
A -> A’ B -> B’ C -> C’ D -> D’ E -> E’

Quick Release
14
A -> A’ B -> B’ C -> C’ D -> D’ E -> E’
PHYSICAL
VIRTUAL

Remapping Phase
App Threads
GC Threads
A -> A’ B -> B’ C -> C’ D -> D’ E -> E’
X
X
X

• Scales to 20Tb heap
‒ No degradation in pause times
• Use one big heap, rather than many small heaps
‒ Less JVMs means more efficiency
• Zing does not require big heaps
‒ 512 Mb minimum
Zing: Big Heaps, No Problem
16

GC Tuning Used To Be Hard
Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g
-XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0
-XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled
-XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12
-XX:LargePageSizeInBytes=256m …
Java –Xms8g –Xmx8g –Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M
-XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2
-XX:-UseAdaptiveSizePolicy -XX:+UseConcMarkSweepGC
-XX:+CMSConcurrentMTEnabled -XX:+CMSParallelRemarkEnabled
-XX:+CMSParallelSurvivorRemarkEnabled
-XX:CMSMaxAbortablePrecleanTime=10000
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC –Xnoclassgc …

GC Tuning With Zing
java -Xmx64g
java -Xmx48g
java -Xmx52g

• jHiccup
• Spends most of its time asleep
‒ Minimal effect on perfomance
‒ Wakes every 1 ms
‒ Records delta of time it expects to wake up
‒ Measured effect is what would be experienced by your application
• Generates histogram log files
‒ These can be graphed for easy evaluation
Measuring Platform Performance
22

Big Heap, Small Latency
23
HotSpot Zing
Elasticsearch with 128Gb heap

Big Heap, Small Latency
24
HotSpot Zing
Elasticsearch with 128Gb heap

Small Heap, Small Latency
25
HotSpot Zing
Hazelcast with 1Gb heap

Adaptive Compilation Challenges
• Traditionally three options for running bytecodes
‒ Fully interpreted
‒ C1 (Client compiler):
 Fast warmup, lower optimal level
‒ C2 (Server compiler):
 Slower warmup, higher optimal level
• Application takes time from starting to optimal level of performance
27

Advancing Adaptive Compilation
• Replacement for C2 compiler
• Azul Falcon JIT compiler
‒ Based on latest compiler research
‒ LLVM project
• Better performance
‒ Better intrinsics
‒ More inlining
‒ Fewer compiler excludes

Simple Code Example
• Simple array summing loop
‒ A modern compiler will use vector operations for this
29

More Complex Code Example
• Conditional array cell addition loop
‒ Hard for compiler to identify for vector instruction use
30

Traditional JVM JIT
31
Per element jumps
2 elements per iteration

Falcon JIT
Using AVX2 vector instructions
32 elements per iteration
Broadwell E5-2690-v4

ReadyNow! Warmup Elimination
33
• Save JVM JIT profiling information
‒ Classes loaded
‒ Classes initialised
‒ Instruction profiling data
‒ Speculative optimisation failure data
• Data can be gathered over much longer period
‒ JVM/JIT profiles quickly
‒ Significant reduction in deoptimisations
• Able to load, initialise and compile most code before main()

Effect Of ReadyNow!
Customer application

ReadyNow! Start Up Time
Performance
Time
Performance
Time
Without ReadyNow!
With ReadyNow!
Class loading, initialising
and compile time

Falcon Pipeline
Zing JVM
Bytecode
frontend
LLVM
LLVM IR
VM
callbacks
Queries
Responses
Compiled
methods
Machine code

Deterministic Compiler
Method for compilation
Initial IR
(Method bytecodes & live profile)
Queries and responses
Produced machine code
Given identical input
Guarantees identical output

Add Compile Stashing
Zing JVM
Bytecode
frontend
LLVM
LLVM IR
VM
callbacks
Queries
Responses
Compiled
methods
Machine
code
Compile
Stash

Compile Stashing Effect
Performance
Time
Performance
Time
Without Compile Stashing
With Compile Stashing
Up to 80% reduction in compile time
and 60% reduction in CPU load

• Start fast
• Go faster
• Stay fast
• Simple replacement for other JVMs
‒ No recoding necessary
The Zing JVM
41
Try Zing free for 30 days:
azul.com/zingtrial

Thank You.
Simon Ritter, Deputy CTO
s.ritter@azul.com
1.650.230.6600
@speakjava

Building a Better JVM

More Related Content

What's hot

Similar to Building a Better JVM

More from Simon Ritter

Recently uploaded

Building a Better JVM