Fix random failures from Live profiling #1376

Baraa-Hasheesh · 2025-07-09T15:28:38Z

Description

Fixes random test failures that happens for live profiling

https://github.com/async-profiler/async-profiler/actions/runs/16172680977/job/45650081518

Related issues

N/A

Motivation and context

solve flakey test

How has this been tested?

make test

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

apangin · 2025-07-09T17:05:34Z

Could you explain why the test was failing and what is the idea of the fix?

Baraa-Hasheesh · 2025-07-09T17:09:52Z

Basically from what I could tell in some cases the VM dies without any proper LIVE sampling by the profiler

The idea is to keep some allocated memory that is never cleared to garantee live reference

Also we call System.gc to update the live reference list in the profiler

apangin · 2025-07-09T17:43:34Z

The idea is to keep some allocated memory that is never cleared

OK, sounds good. I'd only make the amount of allocated memory limited, to ensure the app never throws OutOfMemoryError regardless of heap size and running duration.

Also we call System.gc to update the live reference list in the profiler

In what sense "update"? Why is it needed?

Baraa-Hasheesh · 2025-07-10T07:01:42Z

OK, sounds good. I'd only make the amount of allocated memory limited, to ensure the app never throws OutOfMemoryError regardless of heap size and running duration.

sure, I will limit the number of saved allocations

In what sense "update"? Why is it needed?

GC would cause the following method to be called

void gc() {
  _full = false;
}

This in turn will cause the next void add(JNIEnv* jni, jobject object, jlong size, u64 trace) to update the _refs variable in our async-profiler

information stored in _refs is what's dumped in LIVE profiling

Baraa-Hasheesh · 2025-07-10T07:12:49Z

@apangin memory limited here f2a581d

apangin · 2025-07-10T13:18:17Z

So what happens if System.gc is not called?
The test verifies that live option works. But if it does not work without System.gc call, it is a profiler bug, not a test bug. Nobody normally calls System.gc in their applications.

Baraa-Hasheesh · 2025-07-10T13:40:33Z

So what happens if System.gc is not called? The test verifies that live option works. But if it does not work without System.gc call, it is a profiler bug, not a test bug. Nobody normally calls System.gc in their applications.

In normal applications GC would be called naturally (we want a GC to happen as to update live reference list in profiler)

for this short lived test we want to give the best chances to get the latest data so I manually call System.gc

apangin · 2025-07-10T13:53:22Z

This test allocates 64KB arrays repeatedly in a loop, this is supposed to cause "natural" GC.
Again, if profiler does not detect a leak in this test, this is a strong sign of live mode working poorly.

Baraa-Hasheesh · 2025-07-10T14:08:22Z

This test allocates 64KB arrays repeatedly in a loop, this is supposed to cause "natural" GC. Again, if profiler does not detect a leak in this test, this is a strong sign of live mode working poorly.

That's a good point,
But to clarify LIVE is meant from my understanding to sample data that survives the GC right?

What's confirmed that in some cases we never see LIVE events which could mean that one of the following things happen

GC wasn't called / or was called at a bad time relative to profiler with the test
In some cases no data/content survived the GC

I will need to check which is which on this case

Baraa-Hasheesh · 2025-07-11T07:05:37Z

@apangin

I have an idea about what's happening here,
I have a small reproducer for it

import java.util.ArrayList;
import java.util.List;

public class Main {

    public static void main(String[] args) {
        List<Object> list = new ArrayList<>();

        for (int i = 0; i < 10; i++) {
            System.gc();
        }

        add1(list);
        add2(list);
    }

    public static void add1(List<Object> list) {
        list.add(new byte[64 * 1024]);
    }

    public static void add2(List<Object> list) {
        list.add(new byte[64 * 1024]);
    }
}

For performance reasons, LIVE profiling avoids recording young objects when they get created as they might die in the next GC cycle,
this is done via using the _full flag which we reset on every GC call

This implementation choice creates some gaps where we have some LIVE objects that don't get reported by the profiler

For example consider the following application above

GC is called to reset the _full flag
add1 causes the first allocation to happen which reinitializes the live references in async-profiler
add2 allocations aren't added to the profiler LIVE references as _full is set
Profiler stops & only dumps partial LIVE references

apangin · 2025-07-14T15:05:07Z

Async-profiler allocation profiling uses sampling, it does not handle each individual allocation by design. The same is true for live object profiling. Default allocation sampling interval is 512k, so it's totally fine for the profiler to "miss" some 64k allocations. That's not what the test should verify. However, if allocations happen again and again, async-profiler eventually records some of them.

As to _full flag, it is set only when the whole table of 1024 sampled objects got filled - it is unlikely to be an issue in this particular test. On the contrary, the test fails if there are no live objects in the table.

Baraa-Hasheesh · 2025-07-14T15:34:58Z

Async-profiler allocation profiling uses sampling, it does not handle each individual allocation by design. The same is true for live object profiling. Default allocation sampling interval is 512k, so it's totally fine for the profiler to "miss" some 64k allocations. That's not what the test should verify. However, if allocations happen again and again, async-profiler eventually records some of them.

As to _full flag, it is set only when the whole table of 1024 sampled objects got filled - it is unlikely to be an issue in this particular test. On the contrary, the test fails if there are no live objects in the table.

Thanks for the clarification,
I didn't notice the early termination that avoids setting the _full if the table isn't full

In either case the System.gc shouldn't be needed, I will push a commit to remove it

The main problem is after a GC call it's possible that old LIVE references previously collected would have died & the profiler don't get sampling to record the new LIVE references which is fine

The static array should solve that for this as we will always have references that would never be freed

Baraa-Hasheesh · 2025-07-14T15:36:01Z

System.gc removed here a01fdc5

apangin · 2025-07-14T17:28:52Z

test/test/jfr/JfrMultiModeProfiling.java


    private static volatile Object sink;
    private static int count = 0;
+    private static final List<byte[]> rooter = new ArrayList<>();


The word "rooter" means something else. Maybe rename it to root or holder?

Fix random failures from Live profiling

9e9da0d

Baraa-Hasheesh marked this pull request as draft July 9, 2025 15:38

Baraa-Hasheesh added 2 commits July 9, 2025 15:52

guarantee live sampling

7e780ee

guarantee live sampling

cfdb2f2

Baraa-Hasheesh marked this pull request as ready for review July 9, 2025 16:26

limit memory

f2a581d

remove GC call

a01fdc5

apangin reviewed Jul 14, 2025

View reviewed changes

holder

d09b8c0

apangin merged commit 1c1a14c into async-profiler:master Jul 15, 2025
20 checks passed

Fix random failures from Live profiling #1376

Fix random failures from Live profiling #1376

Uh oh!

Conversation

Baraa-Hasheesh commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Motivation and context

How has this been tested?

Uh oh!

apangin commented Jul 9, 2025

Uh oh!

Baraa-Hasheesh commented Jul 9, 2025

Uh oh!

apangin commented Jul 9, 2025

Uh oh!

Baraa-Hasheesh commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Baraa-Hasheesh commented Jul 10, 2025

Uh oh!

apangin commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Baraa-Hasheesh commented Jul 10, 2025

Uh oh!

apangin commented Jul 10, 2025

Uh oh!

Baraa-Hasheesh commented Jul 10, 2025

Uh oh!

Baraa-Hasheesh commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apangin commented Jul 14, 2025

Uh oh!

Baraa-Hasheesh commented Jul 14, 2025

Uh oh!

Baraa-Hasheesh commented Jul 14, 2025

Uh oh!

apangin Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Baraa-Hasheesh commented Jul 9, 2025 •

edited

Loading

Baraa-Hasheesh commented Jul 10, 2025 •

edited

Loading

apangin commented Jul 10, 2025 •

edited

Loading

Baraa-Hasheesh commented Jul 11, 2025 •

edited

Loading