KEMBAR78
Fix random failures from Live profiling by Baraa-Hasheesh · Pull Request #1376 · async-profiler/async-profiler · GitHub
Skip to content

Conversation

Baraa-Hasheesh
Copy link
Contributor

@Baraa-Hasheesh Baraa-Hasheesh commented Jul 9, 2025

Description

Fixes random test failures that happens for live profiling

https://github.com/async-profiler/async-profiler/actions/runs/16172680977/job/45650081518

Related issues

N/A

Motivation and context

solve flakey test

How has this been tested?

make test


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Baraa-Hasheesh Baraa-Hasheesh marked this pull request as draft July 9, 2025 15:38
@Baraa-Hasheesh Baraa-Hasheesh marked this pull request as ready for review July 9, 2025 16:26
@apangin
Copy link
Member

apangin commented Jul 9, 2025

Could you explain why the test was failing and what is the idea of the fix?

@Baraa-Hasheesh
Copy link
Contributor Author

Basically from what I could tell in some cases the VM dies without any proper LIVE sampling by the profiler

The idea is to keep some allocated memory that is never cleared to garantee live reference

Also we call System.gc to update the live reference list in the profiler

@apangin
Copy link
Member

apangin commented Jul 9, 2025

The idea is to keep some allocated memory that is never cleared

OK, sounds good. I'd only make the amount of allocated memory limited, to ensure the app never throws OutOfMemoryError regardless of heap size and running duration.

Also we call System.gc to update the live reference list in the profiler

In what sense "update"? Why is it needed?

@Baraa-Hasheesh
Copy link
Contributor Author

Baraa-Hasheesh commented Jul 10, 2025

OK, sounds good. I'd only make the amount of allocated memory limited, to ensure the app never throws OutOfMemoryError regardless of heap size and running duration.

sure, I will limit the number of saved allocations

In what sense "update"? Why is it needed?

GC would cause the following method to be called

void gc() {
  _full = false;
}

This in turn will cause the next void add(JNIEnv* jni, jobject object, jlong size, u64 trace) to update the _refs variable in our async-profiler

information stored in _refs is what's dumped in LIVE profiling

@Baraa-Hasheesh
Copy link
Contributor Author

@apangin memory limited here f2a581d

@apangin
Copy link
Member

apangin commented Jul 10, 2025

So what happens if System.gc is not called?
The test verifies that live option works. But if it does not work without System.gc call, it is a profiler bug, not a test bug. Nobody normally calls System.gc in their applications.

@Baraa-Hasheesh
Copy link
Contributor Author

So what happens if System.gc is not called? The test verifies that live option works. But if it does not work without System.gc call, it is a profiler bug, not a test bug. Nobody normally calls System.gc in their applications.

In normal applications GC would be called naturally (we want a GC to happen as to update live reference list in profiler)

for this short lived test we want to give the best chances to get the latest data so I manually call System.gc

@apangin
Copy link
Member

apangin commented Jul 10, 2025

This test allocates 64KB arrays repeatedly in a loop, this is supposed to cause "natural" GC.
Again, if profiler does not detect a leak in this test, this is a strong sign of live mode working poorly.

@Baraa-Hasheesh
Copy link
Contributor Author

This test allocates 64KB arrays repeatedly in a loop, this is supposed to cause "natural" GC. Again, if profiler does not detect a leak in this test, this is a strong sign of live mode working poorly.

That's a good point,
But to clarify LIVE is meant from my understanding to sample data that survives the GC right?

What's confirmed that in some cases we never see LIVE events which could mean that one of the following things happen

  • GC wasn't called / or was called at a bad time relative to profiler with the test
  • In some cases no data/content survived the GC

I will need to check which is which on this case

@Baraa-Hasheesh
Copy link
Contributor Author

Baraa-Hasheesh commented Jul 11, 2025

@apangin

I have an idea about what's happening here,
I have a small reproducer for it

import java.util.ArrayList;
import java.util.List;

public class Main {

    public static void main(String[] args) {
        List<Object> list = new ArrayList<>();

        for (int i = 0; i < 10; i++) {
            System.gc();
        }

        add1(list);
        add2(list);
    }

    public static void add1(List<Object> list) {
        list.add(new byte[64 * 1024]);
    }

    public static void add2(List<Object> list) {
        list.add(new byte[64 * 1024]);
    }
}

For performance reasons, LIVE profiling avoids recording young objects when they get created as they might die in the next GC cycle,
this is done via using the _full flag which we reset on every GC call

This implementation choice creates some gaps where we have some LIVE objects that don't get reported by the profiler

For example consider the following application above

  • GC is called to reset the _full flag
  • add1 causes the first allocation to happen which reinitializes the live references in async-profiler
  • add2 allocations aren't added to the profiler LIVE references as _full is set
  • Profiler stops & only dumps partial LIVE references
image

@apangin
Copy link
Member

apangin commented Jul 14, 2025

Async-profiler allocation profiling uses sampling, it does not handle each individual allocation by design. The same is true for live object profiling. Default allocation sampling interval is 512k, so it's totally fine for the profiler to "miss" some 64k allocations. That's not what the test should verify. However, if allocations happen again and again, async-profiler eventually records some of them.

As to _full flag, it is set only when the whole table of 1024 sampled objects got filled - it is unlikely to be an issue in this particular test. On the contrary, the test fails if there are no live objects in the table.

@Baraa-Hasheesh
Copy link
Contributor Author

Async-profiler allocation profiling uses sampling, it does not handle each individual allocation by design. The same is true for live object profiling. Default allocation sampling interval is 512k, so it's totally fine for the profiler to "miss" some 64k allocations. That's not what the test should verify. However, if allocations happen again and again, async-profiler eventually records some of them.

As to _full flag, it is set only when the whole table of 1024 sampled objects got filled - it is unlikely to be an issue in this particular test. On the contrary, the test fails if there are no live objects in the table.

Thanks for the clarification,
I didn't notice the early termination that avoids setting the _full if the table isn't full

In either case the System.gc shouldn't be needed, I will push a commit to remove it

The main problem is after a GC call it's possible that old LIVE references previously collected would have died & the profiler don't get sampling to record the new LIVE references which is fine

The static array should solve that for this as we will always have references that would never be freed

@Baraa-Hasheesh
Copy link
Contributor Author

System.gc removed here a01fdc5


private static volatile Object sink;
private static int count = 0;
private static final List<byte[]> rooter = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "rooter" means something else. Maybe rename it to root or holder?

@apangin apangin merged commit 1c1a14c into async-profiler:master Jul 15, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants