KEMBAR78
nativemem - Native memory profiler by krk · Pull Request #1064 · async-profiler/async-profiler · GitHub
Skip to content

Conversation

krk
Copy link
Contributor

@krk krk commented Nov 26, 2024

Native memory allocations profiling and memory leak reporting, via GOT patching.

There is a new argument in the profiler nativemem, and jfrconv gains native memory tracking and memory leak reporting.

Usage

New argument --nativemem [N]/nativemem[=N] is used to enable the MallocTracer engine.
It can profile malloc, calloc, realloc and free invocations that are dynamically linked to the target.
When N is specified, only one allocation every N bytes is sampled.

Profiling with nativemem

java -agentpath:"$ASPROF_LIB/libasyncProfiler.so=start,nativemem,file=/dev/stderr" <app>

A flame graph can be generated directly as the profiler output option, which includes all sampled malloc calls. Alternatively, use jfrconv to process a jfr file and chart only the memory leaks:

Memory leak analysis via jfrconv

java -agentpath:"$ASPROF_LIB/libasyncProfiler.so=start,event=cpu,nativemem,file=malloc.jfr" <app>

jfrconv --total --nativemem --leak malloc.jfr malloc-leak.html

# without leak analysis, includes all collected samples:
jfrconv --total --nativemem malloc.jfr malloc.html

Generated flame graphs unit is in bytes if total is specified, otherwise in number of invocations.

asprof

If the profiler is attached via asprof, nativemem profiling can be started and stopped multiple times for the same process as necessary.

asprof start -e nativemem --total <app>
asprof stop <app>
...

How has this been tested?

  • New tests added that call malloc functions with and without multi-threading.

Sample outputs

All allocations

From a test application.

image

Memory leaks only

Profile of the same application, processed via jfrconv.

image


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Native memory allocations profiling and memory leak reporting, via GOT patching.
@krk krk changed the title Natmemtracker nativemem - Native memory profiler Nov 26, 2024
Copy link
Member

@apangin apangin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Makefile Outdated
build-test-java-libs: $(addsuffix /.build-test,$(TESTS))
%/.build-test:
@mkdir -p $(TEST_LIB_DIR)
@if [ -f test/test/$*/Makefile ]; then \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave everything in one root Makefile for now to keep things simple. IMO, recursive make is an overkill for running just one extra command. We can get back to this idea when the structure becomes more branched.

- macOS x64/arm64: [async-profiler-3.0-macos.zip](https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-macos.zip)
- Converters between profile formats: [converter.jar](https://github.com/async-profiler/async-profiler/releases/download/v3.0/converter.jar)
(JFR to Flame Graph, JFR to pprof, collapsed stacks to Flame Graph)
- Converters between profile formats: [converter.jar](https://github.com/async-profiler/async-profiler/releases/download/v3.0/converter.jar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two spaces intentionally meant a line break here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a list.

Comment on lines +46 to +47
--nativemem Generate native memory allocation profile
--leak Only include memory leaks in nativemem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move them after --live, since --live relates to --alloc as --leak to --nativemem.

Btw, will a single option that can be applied to either alloc or nativemem be better? (just an idea, I'm not proposing this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have alloc=native, alloc=jvm, alloc=both options perhaps?

jfrconv --cpu foo.jfr
# which is equivalent to:
# jfrconv --cpu foo.jfr -o foo.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is wrong. Sorry I haven't noticed that before.
It should be jfrconv --cpu -o flamegraph foo.jfr foo.html

src/profiler.cpp Outdated
if (VMStructs::_get_stack_trace != NULL) {
// Object allocation in HotSpot happens at known places where it is safe to call JVM TI,
// but not directly, since the thread is in_vm rather than in_native
num_frames += getJavaTraceInternal(jvmti_frames + num_frames, frames + num_frames, _max_stack_depth);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getJavaTraceInternal must not be used for nativemem, since it can allocate itself.

More generally, getJavaTraceInternal must not be used ever. I have recently removed it completely.

return false;
}

CodeCache* lib = Profiler::instance()->findLibraryByName("libasyncProfiler");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessarily called like this. A library can have an arbitrary name, especially when extracted as a temporary file. Use findLibraryByAddress instead.


lib->mark(
[](const char* s) -> bool {
return strncmp(s, "_ZL11malloc_hook", 16) == 0 || strncmp(s, "_ZL11calloc_hook", 16) == 0 || strncmp(s, "_ZL12realloc_hook", 16) == 0 || strncmp(s, " _ZL9free_hook", 13) == 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break expression into several lines for better readability.

return false;
}

MutexLocker ml(_patch_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, this may conflict with Hooks::patchLibraries() which has its own lock.
It makes sense to unify these two functions (also, to reduce copy-paste).
Can be done in a follow-up PR.

event._address = (uintptr_t)address;
event._size = 0;

Profiler::instance()->recordSample(NULL, 0, MALLOC_SAMPLE, &event);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free calls are not sampled according to _interval.
It may be too expensive to record stack traces for each free. I therefore propose to record free events without a stack trace.

}

Error MallocTracer::check(Arguments& args) {
#ifndef __linux__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid ifdefs in shared code when possible.
OS::isLinux() can help here (should be optimized by the compiler).

System.loadLibrary("doesmalloc");
}

public static native long nativeMalloc(int size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce code duplication, malloc wrappers can be moved to a separate Java class.

src/profiler.cpp Outdated
char mark;
if (current_method_name != NULL && (mark = NativeFunc::mark(current_method_name)) != 0) {
if (mark == MARK_VM_RUNTIME && event_type >= ALLOC_SAMPLE) {
if ((mark == MARK_VM_RUNTIME || mark == MARK_ASYNC_PROFILER) && event_type >= MALLOC_SAMPLE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For malloc events, it will be better to leave the first native frame to make it clear what function exactly was called (malloc/calloc/realloc).

long startTicks = args.from != 0 ? toTicks(args.from) : Long.MIN_VALUE;
long endTicks = args.to != 0 ? toTicks(args.to) : Long.MAX_VALUE;
boolean allowZeroes = !args.nativemem;
IEventReader reader = eventFilter != null ? eventFilter : jfr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to revise this approach - it will not work for multi-chunk JFRs.
An address map should persist between chunks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEventReader instance persists between collectEvents calls, as it is created in the JfrConverter ctor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with a different design around the EventAggregator class.

@krk
Copy link
Contributor Author

krk commented Nov 28, 2024

Addressed comments that have reactions or replies.

Makefile Outdated

build-test-java-libs:
@mkdir -p $(TEST_LIB_DIR)
$(CC) -shared -fPIC $(INCLUDES) -I$(realpath src) -o $(TEST_LIB_DIR)/libdoesmalloc.$(SOEXT) test/test/nativemem/doesmalloc.c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why realpath? Other targets are compiled fine without relative paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was useful in the multi-makefile version, removing.

Makefile Outdated
JAVA_TARGET=8
JAVAC_OPTIONS=--release $(JAVA_TARGET) -Xlint:-options

TEST_LIB_DIR=$(abspath build/test/lib)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it work with relative path?

Makefile Outdated
PROFILER_VERSION := $(PROFILER_VERSION)-$(COMMIT_TAG)
endif

SOEXT ?= so
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SOEXT is already defined appropriately depending on ifeq ($(OS),Darwin)

Makefile Outdated
echo "Running tests against $(LIB_PROFILER)"
$(JAVA) $(TEST_FLAGS) -ea -cp "build/test.jar:build/jar/*:build/lib/*" one.profiler.test.Runner $(TESTS)

LD_LIBRARY_PATH="$(TEST_LIB_DIR)" $(JAVA) $(TEST_FLAGS) -ea -cp "build/test.jar:build/jar/*:build/lib/*" one.profiler.test.Runner $(TESTS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LD_LIBRARY_PATH is Linux-specific. Use -Djava.library.path instead.

}

JNIEXPORT void JNICALL Java_test_nativemem_Native_free(JNIEnv* env, jclass clazz, jlong addr) {
free((void*)addr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this compile on a 32-bit system? I guess an intermediate conversion to intptr_t is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a more realistic flamegraph?
I used to show native memory leaks on this example (strictly speaking, it's not fully a leak, since finalizers are supposed to free native resources, but in practice, finalizers can be very unpredictable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed with a "ByteBuffer leak while reading a stream" scenario.


## Native memory leaks

It is possible to profile `malloc` and `mmap` calls in Java context before, not always helpful. A large amount of allocations does not yet mean a leak, in case all the allocated memory is released on time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something wrong with the first sentence, can you rephrase it?

```
asprof start --nativemem[=N] <YourApp>
asprof stop -f <app.jfr> <YourApp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With JFR output, filename must be given in start command.


public class Native {
static {
System.loadLibrary("doesmalloc");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unobvious name. Maybe jnimalloc?

}

#ifdef __linux__
extern "C" WEAK DLLEXPORT void* malloc(size_t size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nativemem profiler work with non-Java apps if async-profiler is LD_PRELOAD'ed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tested manually.

@MattAlp
Copy link

MattAlp commented Dec 7, 2024

Very nice, thanks for contributing this

im_pthread_exit,
im_pthread_setspecific,
im_poll,
im_malloc,
Copy link

@r1viollet r1viollet Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it is helpful, here are some of the functions we had to hook: https://github.com/DataDog/ddprof/blob/bccb05d89ec9e8473cd7f4f26d4e4948a70839c8/src/lib/symbol_overrides.cc#L852-L899
Though I realize you might not need to have a full coverage for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link. We could implement these in another iteration or prioritize based on customer requests.

}
}

void MallocTracer::recordFree(void* address) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in saying we do not have a sampling decision here (as we do not want to miss matching frees) ?
Here we proposed to record only free events for tracked addresses. However that meant having a lock free data structure containing tracked addresses.
The ideal solution might be to have more control over the allocator being used, to embed such a logic within the allocated object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do not have a sampling decision for free and wanted to avoid having to keep a list of allocation addresses.

void* result = dlopen(filename, flags);
if (result != NULL) {
instance()->updateSymbols(false);
MallocTracer::installHooks();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While relying on a similar strategy, I recall we were not able to catch all loaded libraries. We eventually decided to put a timer that would trigger and recheck loaded libraries every X seconds (in addition to the dlopen hook). If this is of interest, I can find more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to understand when this can happen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the commit thanks to @nsavoire. A private symbol can be used to dlopen __libc_dlopen_mode.
DataDog/ddprof@d0bb6c6

@r1viollet
Copy link

Exciting to see memory profiling as part of async profiler! 👏 I mentioned a few of the gotchas we encountered while implementing something similar in case this can be of interest.


```
asprof start -e nativemem -f app.jfr <YourApp>
# asprof start --nativemem=N -f app.jfr <YourApp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--nativemem N (no =)

public abstract class JfrConverter extends Classifier {
protected final JfrReader jfr;
protected final Arguments args;
private final IEventAggregator leakAggregator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like a leaky abstraction (pun intended): we pretend it to be some abstract aggregator, although we know it can be only MallocLeakAggregator.

leakAggregator.coarsen(args.grain);
}

convertChunk(leakAggregator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An illustration to the above comment: we have a special case for leakAggregator, but trying to deal with it as with some generic aggregator.

Can we design it in a way that we either have a common sequence for any aggregator or don't pretend that leak processor is an aggregator at all and just make leak processor explicitly a different thing?

For example, to consolidate EventAggregator (currently created for each new chunk) and LeakAggegator (a single instance for the whole process) we may probably replace creation of a new EventAggregator with clearing/resetting the same instance of EventAggregator. (Just as an idea; I'm open for other options too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the leak and special-cased invocations.

}

public void collect(Event e, long samples, long value) {
protected void collect(Event e, long samples, long value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public API that is currently used in other projects.

import java.util.ArrayList;
import java.util.Collections;

public abstract class SortedEventAggregator extends EventAggregator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant abstractions, IMO.
There are currently only two aggregators. If we go this way, we can just make MallocLeakAggregator a subclass of EventAggregator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folding SortedEventAggregator into MallocLeakAggregator.


__attribute__((constructor)) static void getOrigAddresses() {
// Set malloc and calloc which may be called from libc during getOrigAddresses.
_orig_malloc = OS::safeAlloc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it really necessary? I remember you were saying allocation in dlsym only happens on error path, and returning NULL does not break anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a specific libc version yes. Removing to see test results.

#define __ASSERT_OR_CHECK_OP(isAssert, val1, op, val2) \
{ \
_Pragma("GCC diagnostic push"); \
_Pragma("GCC diagnostic ignored \"-Waddress\""); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting more and more intricate.
Would not it be better to have a special version of macros for strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could split it at some point.

this.jfr = jfr;
this.args = args;

IEventAggregator agg = new EventAggregator(args.threads, args.total, args.lock ? 1e9 / jfr.ticksPerSec : 1.0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ticksPerSec is only available after reading chunk header. In theory, it may be different for different chunks.
Actually, this value is required only during post-processing (in forEach), so it probably makes sense to move it out of constructor.

args.live ? LiveObject.class :
args.alloc ? AllocationSample.class :
args.lock ? ContendedLock.class : ExecutionSample.class;
protected IEventAggregator collectEvents() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, collectEvents created and owned an aggregator. Now it always uses a singleton created in a constructor. There is no sense in returning an aggregator then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, convertChunk does not need an acceptor argument.

protected IEventAggregator collectEvents() throws IOException {
Class<? extends Event> eventClass = args.live ? LiveObject.class
: args.alloc ? AllocationSample.class
: args.lock ? ContendedLock.class : args.nativemem ? MallocEvent.class : ExecutionSample.class;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, previous formatting was more readable.


// millis can be an absolute timestamp or an offset from the beginning/end of the recording
// millis can be an absolute timestamp or an offset from the beginning/end of
// the recording
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an automatic reformatting. In general, prefer to avoid reformatting code unrelated to the PR.

}

public void resetChunk() {
this.fraction = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fraction is relevant only for coarsen method - I'd reset it in the beginning of coarsen instead.


package one.jfr.event;

public interface IEventAcceptor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you added this, but imo this is unnecessary level of abstraction.
Fortunately, as per my other comment, the only use of IEventAcceptor (and therefore IEventAcceptor itself) can be safely removed.


#ifdef __linux__
extern "C" WEAK DLLEXPORT void* malloc(size_t size) {
if (unlikely(!_orig_malloc)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check MallocTracer::initialized() first? (to make less comparisons on the fast path)

src/profiler.h Outdated
Comment on lines 29 to 35
# ifndef likely
# define likely(x) (__builtin_expect(!!(x), 1))
# endif

# ifndef unlikely
# define unlikely(x) (__builtin_expect(!!(x), 0))
# endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profiler.h is probably not the right place for these macros, they are only used in mallocTracer.cpp, you can put them there. Or, to make them available for a wider use, arch.h will fit better.

}
}

return new ArrayList<>(addresses.values());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of "filtered" events is not actually used, is it? Therefore, creating ArrayList seems redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in multi-chunk jfrs, filtered events will be sorted together with new events, before filtering again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why? Timestamps of malloc events from different chunks should not overlap. Do they?

Comment on lines 80 to 84
eventAggregator.finishChunk();

if (args.grain > 0) {
agg.coarsen(args.grain);
eventAggregator.coarsen(args.grain);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finishChunk and coarsen can be moved out of collectEvents.

Comment on lines 47 to 48
convertChunk(eventAggregator);
eventAggregator.resetChunk();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the design perspective, it looks strange that convertChunk/resetChunk are called again after finish.
I understand they are needed only for MallocLeakAggregator to process the final filtered list, but I think this should be somehow hidden in finish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving them to their own method, to make the intent clearer.

krk and others added 2 commits December 13, 2024 20:06
Signed-off-by: Andrei Pangin <1749416+apangin@users.noreply.github.com>
@apangin apangin merged commit 78f78cf into async-profiler:master Dec 16, 2024
10 checks passed
@apangin
Copy link
Member

apangin commented Dec 16, 2024

I've pushed a fix for recording timestamps correctly. Other than that, PR looks good. There is still one question about filtering events after each chunk, but it's not a blocker right now. Merged. Thanks a lot!

@linking12
Copy link

@krk @apangin We found this running under jemallo will cause the process to crash,Do we have a fixed version?

@apangin
Copy link
Member

apangin commented May 21, 2025

@linking12 Yes, please try a nightly build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants