Highly Scalable Java Programming for Multi-Core System

Highly Scalable Java Programming
for Multi-Core System

Zhi Gan (ganzhi@gmail.com)

http://ganzhi.blogspot.com

Agenda

• Software Challenges

• Profiling Tools Introduction

• Best Practice for Java Programming

• Rocket Science: Lock-Free Programming

2

Software challenges
• Parallelism
– Larger threads per system = more parallelism needed to achieve
high utilization
– Thread-to-thread affinity (shared code and/or data)

• Memory management
– Sharing of cache and memory bandwidth across more threads =
greater need for memory efficiency
– Thread-to-memory affinity (execute thread closest to associated
data)

• Storage management
– Allocate data across DRAM, Disk & Flash according to access
frequency and patterns

3

The 1st Step: Profiling Parallel
Application

Important Profiling Tools
• Java Lock Monitor (JLM)
– understand the usage of locks in their applications
– similar tool: Java Lock Analyzer (JLA)
• Multi-core SDK (MSDK)
– in-depth analysis of the complete execution stack
• AIX Performance Tools
– Simple Performance Lock Analysis Tool (SPLAT)
– XProfiler
– prof, tprof and gprof

Java Lock Monitor

• %MISS : 100 * SLOW / NONREC
• GETS : Lock Entries
• NONREC : Non Recursive Gets
• SLOW : Non Recursives that Wait
• REC : Recursive Gets
• TIER2 : SMP: Total try-enter spin loop cnt (middle for 3
tier)
• TIER3 : SMP: Total yield spin loop cnt (outer for 3 tier)
• %UTIL : 100 * Hold-Time / Total-Time
• AVER-HTM : Hold-Time / NONREC

Multi-core SDK
Dead Lock View

Synchronization View

Best Practice for High Scalable Java
Programming

What Is Lock Contention?

From JLM tool website

Lock Operation Itself Is Expensive
• CAS operations are predominantly used for
locking
• it takes up a big part of the execution time

Reduce Locking Scope
public synchronized void foo1(int k) public void foo2(int k) {
{ String key =
String key = Integer.toString(k); Integer.toString(k);
String value = key+"value"; String value = key+"value";
if (null == key){ if (null == key){
return ; return ;
}else { }else{
maph.put(key, value); synchronized(this){
} maph.put(key, value);
} }
}
}
25%

Execution Time: 16106 Execution Time: 12157
milliseconds milliseconds

Results from JLM report

Reduced AVER_HTM

Lock Splitting
public synchronized void public void addUser2(String u){
addUser1(String u) { synchronized(users){
users.add(u); users.add(u);
} }
}
public void addQuery2(String q){
public synchronized void synchronized(queries){
addQuery1(String q) { queries.add(q);
queries.add(q); }
} }

Execution Time: 12981 Execution Time: 4797 milliseconds
milliseconds
64%

Result from JLM report

Reduced lock tries

Lock Striping
public synchronized void public void put2(int indx,
put1(int indx, String k) { String k) {
share[indx] = k; synchronized
} (locks[indx%N_LOCKS]) {
share[indx] = k;
}
}

Execution Time: 5536 Execution Time: 1857
milliseconds milliseconds

66%

Result from JLM report

More locks with
less AVER_HTM

Split Hot Points : Scalable Counter

– ConcurrentHashMap maintains a independent
counter for each segment of hash map, and use
a lock for each counter
– get global counter by sum all independent
counters

Alternatives of Exclusive Lock
• Duplicate shared resource if possible
• Atomic variables
– counter, sequential number generator, head
pointer of linked-list
• Concurrent container
– java.util.concurrent package, Amino lib
• Read-Write Lock
– java.util.concurrent.locks.ReadWriteLock

Example of AtomicLongArray
public synchronized void set1(int private final AtomicLongArray a;
idx, long val) {
d[idx] = val; public void set2(int idx, long val) {
} a.addAndGet(idx, val);
}

public synchronized long get1(int public long get2(int idx) {
idx) { long ret = a.get(idx); return ret;
long ret = d[idx]; }
return ret;
}

Execution Time: 23550 Execution Time: 842 milliseconds
milliseconds
96%

Using Concurrent Container
• java.util.concurrent package
– since Java1.5
– ConcurrentHashMap, ConcurrentLinkedQueue,
CopyOnWriteArrayList, etc
• Amino Lib is another good choice
– LockFreeList, LockFreeStack, LockFreeQueue, etc
• Thread-safe container
• Optimized for common operations
• High performance and scalability for multi-core
platform
• Drawback: without full feature support

Using Immutable and Thread Local data
• Immutable data
– remain unchanged in its life cycle
– always thread-safe
• Thread Local data
– only be used by a single thread
– not shared among different threads
– to replace global waiting queue, object pool
– used in work-stealing scheduler

Reduce Memory Allocation
• JVM: Two level of memory allocation
– firstly from thread-local buffer
– then from global buffer
• Thread-local buffer will be exhausted quickly
if frequency of allocation is high
• ThreadLocal class may be helpful if
temporary object is needed in a loop

Rocket Science: Lock-Free Programming

Using Lock-Free/Wait-Free Algorithm
• Lock-Free allow concurrent updates of
shared data structures without using any
locking mechanisms
– solves some of the basic problems associated
with using locks in the code
– helps create algorithms that show good
scalability
• Highly scalable and efficient
• Amino Lib

Why Lock-Free Often Means Better Scalability? (I)

Lock:All threads wait for one
Lock free: No wait, but only one can succeed,
Other threads need retry

Why Lock-Free Often Means Better Scalability? (II)

X X

Lock:All threads wait for one
Lock free: No wait, but only one can succeed,
Other threads often need to retry

Performance of A Lock-Free Stack

Picture from: http://www.infoq.com/articles/scalable-java-components

References
• Amino Lib
– http://amino-cbbs.sourceforge.net/
• MSDK
– http://www.alphaworks.ibm.com/tech/msdk
• JLA
– http://www.alphaworks.ibm.com/tech/jla

Highly Scalable Java Programming for Multi-Core System

More Related Content

What's hot

Viewers also liked

Similar to Highly Scalable Java Programming for Multi-Core System

Recently uploaded

Highly Scalable Java Programming for Multi-Core System

Editor's Notes