KEMBAR78
Alexey Ragozin: java
Showing posts with label java. Show all posts
Showing posts with label java. Show all posts

Thursday, September 21, 2023

Curse of the JMX

JMX stands for Java Management Extension, it was introduced as part Java Enterprise Edition (JEE) and later has become an integral part of JVM.

JVM exposes a handful of useful information on diagnostic tooling through the JMX interface.

Many popular tools such as Visual VM and Mission Control are heavily based on JXM. Event Java Flight Record is exposed for remote connection via JMX.

Middleware and libraries are also exploiting JMX to expose custom MBeans with helpful information.

So if you are in the business of JVM monitoring or diagnostic tooling you cannot avoid dealing with JMX.

JMX is a remote access protocol, it is using TCP sockets and requires some upfront configuration for JVM to start listening for network connections (though tools such as VisualVM can enable JMX at runtime, provided they have access to the JVM process).

You can find details about JMX agent configuration in official documentation, but below is minimal configuration (add snippet below to JVM start command).

-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=55555

JVM will start listening on port 5555. You would be able to use this port in Visual VM and other tools.

Configuration above is minimal, access control and TLS encryption are disabled. You should consult documentation mentioned above to add security (which would be typically required in a real environment).

JMX is a capable protocol, but it has some idiosyncrasies due to its JEE lineage. In particular, it has specific requirements for network topology.

JVM is based on Java RMI protocol. Access to JMX agent has a two step handshake.

On the first step, the client makes a request to the RMI registry and receives a serialized remote interface stub.  JXM agent has a built-in single object registry which is exposed on port 5555 in our example.

On the second step, client to accessing remote interface via network address embedded in this stub object received on the first step.

In a trivial network, this is not an issue, but if there are any form of NAT or proxy between JMX client and JVM things are likely to break.

So we have two issues here:

1.    Stub could be exposed on different port number, which is not whitelisted

2.    Stub may provide some kind of internal IP, not routable for client host

First issue is easily solvable with com.sun.management.jmxremote.rmi.port property, which can be set to the same value as registry port (5555 in our example).

Second issue is much more tricky as JVM may be totally unaware of IP visible from outside, even worse such IP could be dynamic so it cannot be configured via JVM command line.

In this article, I would describe a few recipes for dealing with JMX in the modern container/cloud world. None of them is ideal, but I hope at least one could be useful for you.

Configuring JMX for known external IP address

In case if you know a routable IP address, the solution is to configure the JVM to provide specific IP inside of the remote interface stub. Example for this situation would be running a JVM in a local Docker container.

JVM parameter -Djava.rmi.server.hostname=<MyHost> can be used to override IP in remote stubs provided by JMX agent. This parameter affects all RMI communication, but RMI is rarely used nowadays besides the JXM protocol.

Resulting communication scheme is outlined on the diagram below.

JVM options

-Djava.rmi.server.hostname=1.2.3.4
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=55555
-Dcom.sun.management.jmxremote.rmi.port=5555

Communication diagram

Configuring JMX for tunneling

In some situations, the IP address of the JVM host may not be even reachable from the JMX client host. Here is a couple of typical examples

     You are using SSH to access the internal network through a bastion host.

     JVM is in Kubernetes POD.

In both situations you can use port forwarding to establish a network connectivity between JMX client and JVM.

Again, you would need to override IP in remote service stub, but now you will have to set it to 127.0.0.1.

Communication diagram is shown below.

JVM options

-Djava.rmi.server.hostname=127.0.0.1
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=55555
-Dcom.sun.management.jmxremote.rmi.port=5555

Communication diagram

In the case of SSH, you can use port forwarding.
In Kubernetes, there is also a handy
kubectl port-forward command which allows to communicate with POD directly.

You can even chain port-forwarding multiple times.

Though this approach has its own limitations.

     JMX will not be available for remote hosts without port forwarding any more, so this configuration may interfere with monitoring agents running in your cluster and collecting JMX metrics.

     You cannot connect to multiple JVMs using the same JMX port (e.g. PODs from single deployment), as your port on client host is bound to a particular remote destination. Remapping ports will break the JMX.

Using HTTP JMX connector

Root of the problem is the RMI protocol which is archaic and doesn’t evolve to support modem network topologies. JMX is flexible enough to use alternative transport layers and one of them is HTTP (using Jolokia open source project).

Though implementation doesn’t come out of the box. You will have to ship a Jolokia agent jar with your application and introduce it via JVM command like Java agent (see details here).

Good news is that nowadays tools such as VisualVM and Mission Control fully support Jolokia JMX transport. Below are few demo videos for Jolokia project:

     Jolokia from JMC

     Connect Visual VM to a JVM running in Kubernetes using Jolokia

     Connect Java Mission Control to a JVM in Kubernetes

In addition to classic tools, Jolokia HTTP endpoint is accessible from client side JavaScript so web client is also possible. See Hawt.IO project implement diagnostic web console for Java on top of Jolokia.

Using SJK JMX proxy

Dealing with JMX over the years, at some point I have decided to make a diagnostic tool specifically for JMX connectivity troubleshooting.

It is part of SJK - my jack-of-all-knives solution for dealing with JVM diagnostics. mxping command can help to identify, which part of JMX handshake is broken.

While implementing mxping, I have realized that I can solve the root cause of RMI network sensitivity by messing with JMX client code. As I am not eager to patch all JMX tools around, I have introduced JMX Proxy (mxprx), which can be used between JMX Client and remote JVM.

Using JMX proxy may eliminate issues with port forwarding scenario mention above as

     It does require -Djava.rmi.server.hostname=127.0.0.1 on the JVM side.

     Allow you remap ports and thus keep multiple ports forwarded at the same time.

Below is a communication diagram using JMX proxy from SJK.

In addition, with JMX proxy ad hoc configuration of JMX endpoint without JVM restart becomes possible.

JMX agent could be started and configured at runtime via jcmd, but java.rmi.server.hostname can only be set in the command line of the JVM. But with JMX proxy we do not rely on java.rmi.server.hostname anymore!

Below are steps to connect to the JVM in the Kubernetes POD even if JMX was not configured upfront.

1.    Enter the container shell using the kubectl exec command.

2.    In the container, use jcmd ManagementAgent.start to start JMX agent (see more details here).

3.    Forward port from container to your local host.

4.    Start JMX proxy on your host pointing it on localhost:<port forwarded from container> and provide some outbound port (see more details here).

5.    Now you can connect with any JMX aware tool via locally running JMX proxy.

Conclusion

I have listed four alternative approaches for JMX setup. None of them are universal unfortunately and you have to pick one which is most suitable for your case.

While JMX is kind of archaic it is still essential for JVM monitoring and you are likely to have to deal with it for any serious Java based system.

I hope someday HTTP will become built-in and default for JVM and all this trickery will become a horror story from the old days.

Monday, March 11, 2019

Lies, darn lies and sampling bias

Sampling profiling is very powerful technique widely used across various platforms for identifying hot code (execution bottlenecks).

In Java world sampling profiling (thread stack sampling to be precise) is supported by every serious profiler.

While being powerful and very handy in practice, sampling has well known weakness – sampling bias. It is real and well-known problem, though its practical impact is often being over exaggerated.

A picture is worth a thousand of words, so let me jump start with example.

Case 1

Below is a simple snippet of code. This snippet is doing cryptographic hash calculation over a bunch of random strings.

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.concurrent.TimeUnit;

public class CryptoBench {

 private static final boolean trackTime = Boolean.getBoolean("trackTime");
 
 public static void main(String[] args) {
  CryptoBench test = new CryptoBench();
  while(true) {
   test.execute();
  }  
 }
 
 public void execute() {
        long N = 5 * 1000 * 1000;
        RandomStringUtils randomStringUtils = new RandomStringUtils();
        long ts = 0,tf = 0;
        long timer1 = 0;
        long timer2 = 0;
        long bs = System.nanoTime();
        for (long i = 0; i < N; i++) {
         ts = trackTime ? System.nanoTime() : 0;
            String text = randomStringUtils.generate();
            tf = trackTime ? System.nanoTime() : 0;
            timer1 += tf - ts;
            ts = tf;
   crypt(text);
   tf = trackTime ? System.nanoTime() : 0;
   timer2 += tf - ts;
   ts = tf;
        }
        long bt = System.nanoTime() - bs;
        System.out.print(String.format("Hash rate: %.2f Mm/s", 0.01 * (N * TimeUnit.SECONDS.toNanos(1) / bt / 10000)));
        if (trackTime) {
         System.out.print(String.format(" | Generation: %.1f %%",  0.1 * (1000 * timer1 / (timer1 + timer2))));
         System.out.print(String.format(" | Hashing: %.1f %%", 0.1 * (1000 * timer2 / (timer1 + timer2))));
        }
        System.out.println();
 }

    public String crypt(String str) {
        if (str == null || str.length() == 0) {
            throw new IllegalArgumentException("String to encrypt cannot be null or zero length");
        }
        StringBuilder hexString = new StringBuilder();
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            md.update(str.getBytes());
            byte[] hash = md.digest();
            for (byte aHash : hash) {
                if ((0xff & aHash) < 0x10) {
                    hexString.append("0" + Integer.toHexString((0xFF & aHash)));
                } else {
                    hexString.append(Integer.toHexString(0xFF & aHash));
                }
            }
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();
        }
        return hexString.toString();
    }
}
code is available on github

Now let’s use a Visual VM (a profiler bundled with Java 8) and look how much time is actually spent in CryptoBench.crypt() method.

Something in definitely off in screenshot above!

CryptoBench.crypt(), method doing actual cryptography, is attributed only to 33% of execution time.
At same time, CryptoBench.execute() has 67% of self time, and that methods is doing nothing besides calling other methods.

Probably I just need a cooler profiler here. /s

Let’s use Java Flight Recorder for the very same case.
Below is screen shot from Mission Control. 


That looks much better!

CryptoBench.crypt() is now 86% of time our budget. Rest of time code spends in random string generation.
These numbers are looking more belivable to me.

Wait, wait, wait!

Integer.toHexString() is taking as much time as actual MD5 calculation. I cannot belive that.

Numbers are better than ones produced by VisualVM but they are still fishy enough.

Flight recorder is not cool enough for that task! We need really cool profiler! /s

Ok, let me bring some sense into this discrepancy between tools.

We were using thread stack sampling in both tools (Visual VM and Flight Recorder). Though, these tools capture stack traces differently.

Visual VM is actually sampling thread dumps (via thread dump support in JVM). Thread dumps include stack traces for every application thread in JVM, regardless of whatever thread's state is (blocked, sleeping or actually executing code) and this dump is taken atomically. It reflects instant execution state of whole JVM (which is important for deadlock/contention analysis). In practice, that implies short Stop the World pause for each dump. Stop the World pause means safepoint in hotspot JVM. And safepoints brings some nuances.

When Visual VM requests thread dump, JVM notifies threads to suspend execution, but a thread executing Java code wouldn’t stop immediately (unless it is interpreted). The thread would continue to run until next safepoint check where it can suspend itself. Checks cost CPU cycles so they are sparse in JIT generated code.

Checks are placed inside of loops and after method returns. Though, checks are omitted for loops considered “fast” by JIT compiler (typically integer indexed loops). Small methods are aggressively inlined too, hence omiting safepoint check at return. As a consequence, a hot and calculation intensive code may be optimized by JIT into single chunk of machine code which is mostly free of safepoint checks.

If you are lucky, thread dump would show you a line invoking the method containing hot code. With less luck result would be even more misleading.

So in Visual VM call tree we see method CryptoBench.execute() at top of the stack for 66% of samples. If we would be able to see call tree at line number granularity is would be a line calling CryptoBench.crypt() method.

Bad, ugly safepoint bias I’ve caught you red handed! /s

So, how Flight Recorder does sample stacks and why numbers are different?

Flight Recorder sampling doesn’t involve full thread dumps. Instead it freezes threads one by one using OS provided facilities. Once thread is frozen; we can get address of next instruction to be executed out of stack memory area. Address of instruction is converted into line number of java source code via byte code to machine code symbol map. The map is generated during JIT compilation. This is how stack trace is reconstructed.

In case of Flight Recorder safepoint bias does not apply. Though results are still looking inaccurate. Why?

Below is another session with Flight Recorder for the very same code.

 

Picture is different now.

Integer.toHexString() is just 2.25% of out execution budget which is more trustworthy in my eyes.

Flight Recoder has to resolve memory addresses back to reference of bytecode instruction (which is further transalted into Java source line). Mapping generated by JIT compiler is used for that purpose.

Though compiler is aware that we can see thread stack trace only at safepoints. By default, only safepoint checks are mapped into bytecode instruction indexes. Flight Recorder takes execution address from stack, then it finds next address mapped to Java code in symbol table. In case of aggressive inlining, Flight Recorder can map address to whole wrong point in code.

Though sampling itself is not biased by safepoints, symbol map generated by JIT compiler is.

In second example, I’ve used two JVM options to force more detailed symbol maps to be generated by JIT compler. Options are below.

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

More accurate, free of bias, symbol map allows Flight Recorder to produce more accurate stack traces.

In our mental model, code is being executed line by line (bytecode instruction by instruction). But complier lumps bunch of methods together and generates single blob of machine code, aggressively reordering operations in the middle of process to make code faster.
Our mental model of line by line execution is totally broken by compiler optimization.

Though, in practice artifacts of operation reordering are not that striking as safepoint bias.

So Java Flight Recorder is cool, Visual VM is not. Should I make this conclusion?

Let me present a counter example.

Case 2

Below is profiling reports from a differnt case.

Now I’m using flame graph generated from data captured by Visual VM and Flight Recorder (with –XX:+DebugNonSafepoints).

Visual VM report 

 

Flight Recorder report 

Both graphs are showing InflaterInputStream to be a bottleneck. Though Visual VM assesses time spent as 98%, but in Flight Recorder it is just 47%.

Who is right?

Correct answer is 92% (which is approximated using differential analysis).

My heart is broken! Flight Recorder is not a silver bullet. /s

What have gone wrong?

In this example, hot spot was related to JNI overhead involved with calling native code in zlib. It seems like Flight Recorder were unable reconstruct stack trace for certain samples outside of Java code and dropped these samples. Sample population was biased by native code execution. That bias has played against Flight Recorder in this case.

Conclusion

Both profilers are doing that they intended to do. Some sort of bias is natural for almost any kind of sampling.

Each sampling profiler could be categorized by three aspects.

  • Blind spots bias – which samples are excluded from data set collected by profiler.
  • Attractor bias – how samples be attracted to specific discrete points (e.g. safe point).
  • Resolution – unit of code which profiling data is being aggregated to (e.g. method, line number etc).

Below is summary table for sampling methods mentioned in this article.

Blind spot Attractor Resolution
JVM Thread Dump Sampling non-java threads safepoint bias java frames only
Java Flight Recorder non-java code execution CPU pipeline bias
+ code to source mapping skew
java frames only
Java Flight Recorder
+ DebugNonSafepoint
non-java code execution CPU pipeline bias
+ code to source mapping skew
java frames only


Wednesday, May 30, 2018

SJK is learning new tricks

SJK or (Swiss Java Knife) was my secret weapon for firefighting various types of performance problems for long time.

A new version of SJK was released not too long ago and it contains а bunch of new and powerful features I would like to highlight.

ttop contention monitoring

SJK is living it's name by bundling a number of tool into single executable jar. Though, ttop is a likely single most commonly used tool under SJK roof.

ttop is a kind top for threads of JVM process. Besides CPU usage counter (provided by OS) and allocation rate (tracked by JVM), a new thread contention metrics was introduced in recent SJK release.

Thread contention metrics are calculated by JVM, which counts and times when Java threads enters into BLOCKED or WAITING state.

If enabled, SJK is using these metrics to display rates and percentage of time spent in either state.

2018-05-29T14:20:03.382+0300 Process summary 
  process cpu=231.09%
  application cpu=212.78% (user=195.86% sys=16.92%)
  other: cpu=18.31% 
  thread count: 157
  GC time=4.72% (young=4.72%, old=0.00%)
  heap allocation rate 976mb/s
  safe point rate: 6.3 (events/s) avg. safe point pause: 8.24ms
  safe point sync time: 0.07% processing time: 5.09% (wallclock time)
[000180] user=19.40% sys= 0.31% wait=183.6/s(75.77%) block=    0/s( 0.00%) alloc=  110mb/s - hz._hzInstance_2_dev.cached.thread-8
[000094] user=16.92% sys= 0.16% wait=58.50/s(81.54%) block=    0/s( 0.00%) alloc=   94mb/s - hz._hzInstance_3_dev.generic-operation.thread-0
[000057] user=15.05% sys= 0.62% wait=56.91/s(82.35%) block= 0.20/s( 0.01%) alloc=   91mb/s - hz._hzInstance_2_dev.generic-operation.thread-0
[000095] user=15.21% sys= 0.00% wait=55.61/s(82.32%) block= 0.30/s( 0.04%) alloc=   87mb/s - hz._hzInstance_3_dev.generic-operation.thread-1
[000022] user=14.59% sys= 0.00% wait=56.01/s(83.42%) block= 0.30/s( 0.08%) alloc=   86mb/s - hz._hzInstance_1_dev.generic-operation.thread-1
[000058] user=13.97% sys= 0.16% wait=56.91/s(84.13%) block= 0.10/s( 0.02%) alloc=   81mb/s - hz._hzInstance_2_dev.generic-operation.thread-1

An important fact about these metrics is - CPU time + WAITING + BLOCKED should be 100% in ideal world.

In reality, you a likely to see a gap. A few reason why equation above is not holding:

  • GC pauses are freezing thread execution, but not accounted by thread contention monitoring,
  • thread may be waiting for IO operation, but it is not accounted as BLOCKED or WAITING state by JVM,
  • system may starve on CPU resource and thread is waiting for CPU core on OS level (which is also not accounted by JVM).

Contention monitoring is not enabled by default, use -c flag with ttop command to enabled it.

HTML5 based flame graph

SJK was able to produce flame graphs for sometime already. Though, old flame graphs were generated as svg with limited interactivity.

New version offers a new type of flame graphs based on HTML5 and interactive. Right in browser it allows:

  • filtering data by threads,
  • zoom into specific paths or by presence of specific frame,
  • filtering data by thread state (if state information is available).

HTML5 report is 100% self contained file with no dependencies, it can sent it by email and open on any machine. Here is an example of new flame graph you can play right now.

New flame command is used to generate HTML5 flame graphs.

`jstack` dump support

SJK is accepting a number of input data formats for thread sampling data, which is used for flame graphs and other types of performance analysis.

A new format added in 0.10 version is text thread dump formats produced by jstack. Full list of input formats now:

  • SJK native thread sampling format
  • JVisualVM sampling snapshots (.nps)
  • Java Flight Recorder recording (.jfr)
  • jstack produced text thread dumps

Thursday, June 15, 2017

HeapUnit - Test your Java heap content

There are usually a number of tests which you would like to run for each build to make sure what your code does make sense. Typically, such tests would be focusing on business function of your code.

Though, on a rare occasion, you would really like to test certain non-functional aspects. A memory/resource would be a good example.

How would you test memory leak?

This is quite a challenge, right?

You can use debugger or profiler to inspect internal state of your system. Though, that approach assumes manual testing.

You can write test which would stress your system provoking OutOfMemoryError which would fail your test if code has defect. That generally works, though adding a stress test to mostly functional automatic test pack may not be a best idea. That approach may not work for other kind of resource leaks.

You can exploit weak or phantom reference to trace garbage collector work. This approach makes test more lightweight compared to fully fledged stress testing, but it is not applicable in many cases. E.g. you may not have a reference to leak suspected objects.

For some time I was actively practising automated inspection of JVM heap dumps for diagnostic purposes. JVM could easily produce its own heap dump (using JVM attach interface) and that dump can be inspected via API to assert certain invariants (e.g. number of live instances of particular type). Why not use it for resource leak testing and similar cases?

Resurrecting object from dump

Heap dump API allows you to inspect fields of dumped objects; there is also heap path notation for writing sophisticated selectors. Though, you cannot invoke methods, not even toString() or equals(), on objects from dump. For quantitative analysis of, this is ok. But for asserting complex conditions typical to test scenario, dealing with Java objects may be much more convenient, though.

Heap dump doesn?t contain full class information. But if dump is produced from JVM we are running in we can relay on class metedata available through reflection.

Objenesis library and Java reflection is used to convert instance data from heap dump back to normal Java objects.

At the end, usage of HeapUnit is fairly simple. Using API you can

  • take heap dump
  • select certain types of instance from dump by class or heap path notation
  • inspect instance?s fields using symbolic names
  • or rehydrate instance into Java object

Example

Below is a simple example listing Socket objects in JVM

@Test
public void printSockets() throws IOException {

    ServerSocket ss = new ServerSocket();
    ss.bind(sock(5000));

    Socket s1 = new Socket();
    Socket s2 = new Socket();

    s1.connect(sock(5000));
    s2.connect(sock(5000));

    ss.close();
    s1.close();
    // s2 remains unclosed

    HeapImage hi = HeapUnit.captureHeap();

    for(HeapInstance i: hi.instances(SocketImpl.class)) {
        // fd field in SocketImpl class is nullified when socket gets closed
        boolean open = i.value("fd") != null;
        System.out.println(i.rehydrate() + (open ? " - open" : " - closed"));
    }
}

HeapUnit library is available in Maven Central repo. You can bring it to your project using Maven coordinates below.

<dependency>
    <groupId>org.gridkit.heapunit</groupId>
    <artifactId>heapunit</artifactId>
    <version>0.2</version>
</dependency>

Tuesday, October 25, 2016

HotSpot JVM garbage collection options cheat sheet (v4)

After three years, I have decided to update my GC cheat sheet.

New version finally includes G1 options, thankfully there are not very many of them. There are also few useful options introduced to CMS including parallel inital mark and initiating concurrent cycles by timer.

Finally, I made separate cheat sheet versions for Java 7 and Java 8.

Below are links to PDF versions

Java 8 GC cheat sheet

Java 7 GC cheat sheet

Friday, September 16, 2016

How to measure object size in Java?

You define fields, their names and types, in source of Java class, but it is JVM the one who decides how they will be stored in physical memory.

Sometimes you want to know exactly how much Java object weights in Java. Answering this question is surprisingly complicated.

Challenge

  • Pointer size and Java object header size varies.
  • JVM could be build for 32 or 64 bit architecture. On 64 bit architectures JVM may or may not use compressed pointers (-XX:+UseCompressedOops).
  • Object padding may be different (-XX:ObjectAlignmentInBytes=X).
  • Different field types may have different alignment rules.
  • JVM may reorder fields in object layout as it likes.

Figure below illustrates how JVM may rearrange fields in memory.

Guessing object layout

You can scrap class fields via reflection and try to guess layout chosen by JVM taking into account platform pointer size and other factors.

... at least you can try.

Using the Unsafe

sun.misc.Unsafe is internal helper class used by JVM code. You should not use it, but you can (with some help from reflection). Unsafe is popular among people doing weird things with JVM.

Unsafe can let you query information about physical layout of Java object. Though, it would not tell you directly real size of object in memory. You would still have to do some error-prone math to calculate object's size.

Here is example of such code.

Instrumentation agent

java.lang.instrument.Instrumentation is an API for profilers and other performance tools. You need to install agent into JVM to get instance of this class. This class has handy getObjectSize(...) method which would tell you real object size.

There is library jamm which exploit this option. You should use special JVM start options though.

Threading MBean

Threading MBean in JVM has a handy allocation counter. Using this counter you can easily measure object size by allocating new instance and checking delta of counter. Snippet below is doing just that.

import java.lang.management.ManagementFactory;

public class MemMeter {

    private static long OFFSET = measure(new Runnable() {
        @Override
        public void run() {
        }
    });

    /**
     * @return amount of memory allocated while executing provided {@link Runnable}
     */
    public static long measure(Runnable x) {
       long now = getCurrentThreadAllocatedBytes();
       x.run();
       long diff = getCurrentThreadAllocatedBytes() - now;
       return diff - OFFSET;
    }

    @SuppressWarnings("restriction")
    private static long getCurrentThreadAllocatedBytes() {
        return ((com.sun.management.ThreadMXBean)ManagementFactory.getThreadMXBean()).getThreadAllocatedBytes(Thread.currentThread().getId());
    }
}

Below is simple usage example

System.out.println("size of java.lang.Object is " 
+ MemMeter.measure(new Runnable() {

    Object x;

    @Override
    public void run() {
        x = new Object();
    }
}));

Though, this approach require you to create new instance of object to measure its size. That may be an obstacle.

jmap

jmap is a one of JDK tools. With jmap -histo PID command you can print histogram of your heap objects.

num     #instances         #bytes  class name
---------------------------------------------
  1:       1413317      111961288  [C
  2:        272969       39059504  <constMethodKlass>
  3:       1013137       24315288  java.lang.String
  4:        245685       22715744  [I
  5:        272969       19670848  <methodKlass>
  6:        206682       17868464  [B
  7:         29355       17722320  <constantPoolKlass>
  8:        659710       15833040  java.util.HashMap$Entry
  9:         29355       12580904  <instanceKlassKlass>
 10:        105637       12545112  [Ljava.util.HashMap$Entry;
 11:        170894       11797400  [Ljava.lang.Object;

For objects, you can divide byte size by instance count to get individual instance size for class. This would not work for arrays, though.

Java Object Layout tool

Java Object Layout tool is using number of different approaches for introspecting physical layout of Java object in memory.

Thursday, July 21, 2016

Rust, JNI, Java

Recently, I had a necessity to do some calls to kernel32.dll from my Java code. Just a few system calls on Windows platform, as simple as it sounds. Plus I wanted to keep resulting size of binary as small as possible.

Later requirement has added a fair challenge to that task.

How to call platform code for Java?

JNI - Java Native Interface

JNI is built in JVM and is part of Java standard. Sounds good, there is a catch though. To call native code from Java via JNI, you have to write native code (e.g. using C language). That is it, JNI requires some glue code (aka bindings) between native calls and Java methods.

m... do we have other alternatives?

JNA - Java Native Access

JNA is an alternative to JNI. You can call native code from Java, no glue code. Cool, what is the cost?

JNA jar has size of 1.1 MiB. Extra megabyte just to do couple of simple calls to Windows kernel - not a deal.

Back to JNI

Ok, I need to write some glue code for JNI. What language to choose?

C/C++ - no, just no. C/C++ tool chain, compiler, headers, build tools, is an abomination, especially on Windows. Please, I just need literally half screen of code compiled to dll binary. I do not want 10 GiB worth Visual Studio to pollute my desktop.

Die hard Java guy is speaking :)

Free Pascal

Pascal is an ancient language. It was programming language of my youth. MS DOS, Turbo Pascal ... colors were so bright these days.

Twenty years later, I was surprised to find Pascal in pretty good shape. Free Pascal has impressive list of supported platforms. Pascal compiler is lighting fast. Produced binaries have no dependency on libc / msvcrt.

Using Free Pascal I get my kernel32-to-JNI dll with size of 33 KiB. That sounds much, much better.

Can we do better?

Rust

Rust is a new kid in a language block. It has a strong ambition to replace C/C++ as system level language. It gives you all powers of C plus memory safety, modernized build system, language level modules (crates).

Sounds promising, let's try Rust for little JNI glue dll.

Calling

rustc -C debuginfo=0 --crate-type dylib myjni.rs

result is disappointing 2.5 MiB binary.

Rust dylib is a dll which can be used by other Rust code, so it is exposing a lot of language specific metadata. cdylib is a new packaging introduced in Rust 1.10, which is more suitable for JNI bindings.

Command line

rustc -C lto -C debuginfo=0 --crate-type cdylib myjni.rs

has produced 1.6 MiB binary. -C lto option instructs compiler to do "link time optimization". For some reason cdylib was not compiling without lto option for me.

Ok, direction is right, but we need to move much further. Let's try more compiler options.

Command line

rustc -C opt-level=3 -C lto -C debuginfo=0 --crate-type cdylib myjni.rs

has produced 200 KiB binary. Optimization allow compiler to throw away a big portion of standard library which I will never need for my simple JNI binding.

Though, a large portion of standard library is still there.

In Rust you can fully turn off standard library (e.g. to run on bare metal).

Normally you would need at least memory management, but for simple JNI binding you can get away using stack allocation only.

At the moment, using Rust with no_std option requires nightly build of compiler. I have also rewrite some portion of kernel32 and JNI declarations to avoid dependency on libc types.

rustc -C opt-level=3 -C panic=abort -C lto -C debuginfo=0 --crate-type cdylib myjni.rs

Binary size is 22.5 KiB.

Cool, we have beaten Free Pascal.

One more tweak, execute strip -s on resulting dll and final binary size is 16.9 KiB.

Honestly, 16.9 KiB for couple of calls is still overkill. But, I'm not desperate enough to try assembly for JNI binding, at least not today.

Conclusion

Free Pascal IMHO, Free Pascal a good choice if you need simple JNI bindings. As a bonus, Free Pascal on Linux has no dependency on platform's dynamic libraries, so you can build cross-Linux-distro binaries.

Rust. I believe Rust have a great potential. Rust has unique memory safety model yet it let you to get as close to bare metal as C does. Besides other features, Rust has really promising cross compiling capabilities, which gives it a very strong position in embedded / IoT space.

Yet, Rust needs to get more stable. no_std feature is not available in latest (1.10) stable. cdynlib is not supported by latest stable cargo tool. Rust tool chain on Windows depends either on MS Visual Studio or MSys. Resulting binaries are slightly incompatible to each other (Oracle JMV is build with Visual Studio, so using MSys built JNI bindings leads to process crash in certain cases).

Wednesday, March 16, 2016

Finalizers and References in Java

Automatic memory management (garbage collection) is one of essential aspects of Java platform. Garbage collection relieves developers from pain of memory management and protects them from whole range of memory related issues. Though, working with external resources (e.g. files and socket) from Java becomes tricky, because garbage collector alone is not enough to manage such resources.

Originally Java had finalizers facility. Later special reference classes were added to deal with same problem.

If we have some external resource which should be deallocated explicitly (common case with native libraries), this task could be solved either using finalizer or phantom reference. What is the difference?

Finalizer approach

Code below is implementing resource housekeeping using Java finalizer.

public class Resource implements ResourceFacade {

    public static AtomicLong GLOBAL_ALLOCATED = new AtomicLong(); 
    public static AtomicLong GLOBAL_RELEASED = new AtomicLong(); 

    int[] data = new int[1 << 10];
    protected boolean disposed;

    public Resource() {
        GLOBAL_ALLOCATED.incrementAndGet();
    }

    public synchronized void dispose() {
        if (!disposed) {
            disposed = true;
            releaseResources();
        }
    }

    protected void releaseResources() {
        GLOBAL_RELEASED.incrementAndGet();
    }    
}

public class FinalizerHandle extends Resource {

    protected void finalize() {
        dispose();
    }
}

public class FinalizedResourceFactory {

    public static ResourceFacade newResource() {
        return new FinalizerHandle();
    }    
}

Phantom reference approach

public class PhantomHandle implements ResourceFacade {

    private final Resource resource;

    public PhantomHandle(Resource resource) {
        this.resource = resource;
    }

    public void dispose() {
        resource.dispose();
    }    

    Resource getResource() {
        return resource;
    }
}

public class PhantomResourceRef extends PhantomReference<PhantomHandle> {

    private Resource resource;

    public PhantomResourceRef(PhantomHandle referent, ReferenceQueue<? super PhantomHandle> q) {
        super(referent, q);
        this.resource = referent.getResource();
    }

    public void dispose() {
        Resource r = resource;
        if (r != null) {
            r.dispose();
        }        
    }    
}

public class PhantomResourceFactory {

    private static Set<Resource> GLOBAL_RESOURCES = Collections.synchronizedSet(new HashSet<Resource>());
    private static ResourceDisposalQueue REF_QUEUE = new ResourceDisposalQueue();
    private static ResourceDisposalThread REF_THREAD = new ResourceDisposalThread(REF_QUEUE);

    public static ResourceFacade newResource() {
        ReferedResource resource = new ReferedResource();
        GLOBAL_RESOURCES.add(resource);
        PhantomHandle handle = new PhantomHandle(resource);
        PhantomResourceRef ref = new PhantomResourceRef(handle, REF_QUEUE);
        resource.setPhantomReference(ref);
        return handle;
    }

    private static class ReferedResource extends Resource {

        @SuppressWarnings("unused")
        private PhantomResourceRef handle;

        void setPhantomReference(PhantomResourceRef ref) {
            this.handle = ref;
        }

        @Override
        public synchronized void dispose() {
            handle = null;
            GLOBAL_RESOURCES.remove(this);
            super.dispose();
        }
    }

    private static class ResourceDisposalQueue extends ReferenceQueue<PhantomHandle> {

    }

    private static class ResourceDisposalThread extends Thread {

        private ResourceDisposalQueue queue;

        public ResourceDisposalThread(ResourceDisposalQueue queue) {
            this.queue = queue;
            setDaemon(true);
            setName("ReferenceDisposalThread");
            start();
        }

        @Override
        public void run() {
            while(true) {
                try {
                    PhantomResourceRef ref = (PhantomResourceRef) queue.remove();
                    ref.dispose();
                    ref.clear();
                } catch (InterruptedException e) {
                    // ignore
                }
            }
        }
    }
}

Implementing same task using phantom reference requires more boilerplate. We need separate thread to handle reference queue, in addition, we need to keep strong references to allocated reference objects.

How finilaizers work in Java

Under the hood, finilizers work very similarly to our phantom reference implementation, though, JVM is hiding boilerplate from us.

Each time instance of object with finalizer is created, JVM creates instance of FinalReference class to track it. Once object becomes unreachable, FinalReference is triggered and added to global final reference queue, which is being processed by system finalizer thread.

So finalizes and phantom reference approach work very similar. Why should you bother with phantom references?

Comparing GC impact

Let's have simple test: resource object is allocated then added to the queue, once queue size hits limit oldest reference is evicted and thrown away. For this test we will monitor reference processing via GC logs.

Running finalizer based implementation.

[GC [ParNew[ ... [FinalReference, 5718 refs, 0.0063374 secs] ... 
Released: 6937 In use: 59498

Running phantom based implementation.

[GC [ParNew[ ... [PhantomReference, 5532 refs, 0.0037622 secs] ... 
Released: 5468 In use: 38897

As you can see, once object becomes unreachable, it needs to be handled in GC reference processing phase. Reference processing is a part of Stop-the-World pause. If, between collections, too many references becomes eligible for processing it may prolong Stop-the-World pause significantly.

In case above, there is no much difference between finalizers and phantom references. But let's change workflow a little. Now we would explicitly dispose 99% of handles and rely on GC only for 1% of references (i.e. semiautomatic resource management).

Running finalizer based implementation.

[GC [ParNew[ ... [FinalReference, 6295 refs, 0.0070033 secs] ...
Released: 6707 In use: 1457

Running phantom based implementation.

[GC [ParNew[ ... [PhantomReference, 625 refs, 0.0001551 secs] ... 
Released: 21682 In use: 1217

For finalizer based implementation there is no difference. Explicit resource disposal doesn't help reduce GC overhead. But with phantoms, we can see what GC do not need to handle explicitly disposed references (so number of references process by GC is reduced by order of magnitude).

Why this is happening? When resource handle is disposed we drop reference to phantom reference object. Once phantom reference is unreachable, it would never be queued for processing by GC, thus saving time in reference processing phase. It is quite opposite with final references, once created it will be strong referenced by JVM until being processed by finalizer thread.

Conclusion

Using phantom references for resources housekeeping requires more work compared to plain finalizer approach. But using phantom references you have far more granular control over whole process and implement number of optimizations such as hybrid (manual + automatic) resource management.

Full source code used for this article is available at https://github.com/aragozin/example-finalization.