KEMBAR78
Add cpu sampling by JugadK · Pull Request #1286 · async-profiler/async-profiler · GitHub
Skip to content

Conversation

JugadK
Copy link
Contributor

@JugadK JugadK commented May 8, 2025

Description

Adds a "cpu" virtual frame at the bottom of each stack when one passes in --record-cpu as a flag

Related issues

#1240

Motivation and context

Discussed in the issue, but I want to be able to see which cpu a sample was taken on, this is done through enabling PERF_EVENT_CPU for the perf profile when --record-cpu is passed as a flag. This is read from the ring buffer into StackContext which is made into a virtual frame at the bottom of the stack (Similar to --sched).

How has this been tested?


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

case BCI_CPU: {
int cpu = (int)(uintptr_t)frame.method_id;
char buf[32];
snprintf(buf, sizeof(buf), "%d", cpu-1);
Copy link
Contributor Author

@JugadK JugadK May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative idea I had was to make the line in the beginning of this function (Instead of having the +1 offset)

if (frame.method_id == NULL) {

be

if(frrame.bci != BCI_CPU && frame.method_id == NULL)

so it will skip this if method_id is zero, not familiar with the codebase though so not sure if that would have a bad side effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd set some bit instead - to make sure method_id is never NULL and to distinguish it from BCI_THREAD, which uses the same encoding.

Copy link
Contributor Author

@JugadK JugadK May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like just set the left most bit then unset it when we are reading it? that works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm what if I shift left by one and set, then shift right when reading? don't have to make assumptions on the size of a pointer on the system

@JugadK JugadK marked this pull request as ready for review May 9, 2025 20:46
CASE("inverted")
_inverted = true;

CASE("record-cpu")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a section for FlameGraph options as the comment above states.
I'd put new option next to sched, which is kind of similar behavior-wise.

case BCI_CPU: {
int cpu = (int)(uintptr_t)frame.method_id;
char buf[32];
snprintf(buf, sizeof(buf), "%d", cpu-1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd set some bit instead - to make sure method_id is never NULL and to distinguish it from BCI_THREAD, which uses the same encoding.

int cpu = (int)(uintptr_t)frame.method_id;
char buf[32];
snprintf(buf, sizeof(buf), "%d", cpu-1);
return _str.assign("[cpu").append(buf).append("]").c_str();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why building string part-by-part, if you can format the whole thing in the above snprintf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally I was using std::to_string() and appending, will change it to snprintf

#ifdef PERF_ATTR_SIZE_VER5
if (_cstack == CSTACK_LBR) {
attr.sample_type |= PERF_SAMPLE_BRANCH_STACK | PERF_SAMPLE_REGS_USER;
attr.sample_type |= PERF_SAMPLE_BRANCH_STACK | PERF_SAMPLE_REGS_USER | PERF_SAMPLE_CPU;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked this actually works? It seems that the field is always added, but the reader only parses it when _record_cpu is set.

Copy link
Contributor Author

@JugadK JugadK May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this is from when i was hacking it together not sure why I left it in (it should only be enabled in the if statement block below

src/profiler.cpp Outdated
Comment on lines 702 to 704
// +1 is to prevent the method_id from being read as a null pointer
// when cpu = 0
num_frames += makeFrame(frames + num_frames, BCI_CPU, java_ctx.cpu + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix indentation.

const void* pc;
uintptr_t sp;
uintptr_t fp;
u64 cpu;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change it to 64-bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ring.next() returns u64 so i thought it was nicer to just keep it as the type that was originally returned. Though I don't think it makes much of a difference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for the explanation

this->cpu = 0;
}

void set(const void* pc, uintptr_t sp, uintptr_t fp, u64 cpu) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this function used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not, since it seems like we have a function to set all fields in that class I thought it would make sense to add another function with the option of passing CPU for completion sake/future usage. Can remove if you feel its not needed

Comment on lines 893 to 894
u64 cpu = ring.next();
java_ctx->cpu = cpu;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 4 spaces indent.
IMO, local variable is redundant here, just set java_ctx->cpu = ring.next();

src/profiler.cpp Outdated
num_frames += makeFrame(frames + num_frames, BCI_ERROR, OS::schedPolicy(0));
}
if (_add_cpu_frame) {
num_frames += makeFrame(frames + num_frames, BCI_CPU, (java_ctx.cpu << 1) | 0x1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of setting a higher bit instead. In this case, there will be no need for shift, and frame values for BCI_CPU will not clash with values for BCI_THREAD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked this approach as it prevents as errors from occurring from mistakes in casting, changes in intermediary datatypes and different pointer/int sizes on various platforms (i.e one of them being smaller than whatever nth bit we set) in practice just setting the 16th bit would probably have the same effect so I can do that

Im curious what you mean by "clashing". they are differentiated by the bci enum, but is there some problem with them being the same value in the method_id pointer? maybe I'm missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same value for method_id can be a problem as long as MethodMap uses jmethodID as a key.

_add_event_frame = args._output != OUTPUT_JFR;
_add_thread_frame = args._threads && args._output != OUTPUT_JFR;
_add_sched_frame = args._sched;
_add_cpu_frame = args._record_cpu;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since record_cpu is only supported with PerfEvents, do you want to return an error when record_cpu is requested for any other engine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that sounds like a good idea

this->pc = pc;
this->sp = sp;
this->fp = fp;
this->cpu = 0;
Copy link
Member

@apangin apangin May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to overwrite cpu value set earlier. Did you test different stack walking modes (fp, dwarf, vm, default)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm yeah default,vm,vmx work but fp, dwarf do not, think i'll just remove setting it as 0

// filter=FILTER - thread filter
// threads - profile different threads separately
// sched - group threads by scheduling policy
// record-cpu - record which cpu a sample was taken on (default: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind updating ProfilingOptions.md doc as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can you also add a sanity test for --record-cpu? Ideally, all new features should be covered by at least one test. Alternatively, you may modify one of existing --target-cpu tests by adding --record-cpu to it.

jkhajuria and others added 3 commits May 12, 2025 17:09
public void perfEventsRecordCpuEventsCount(TestProcess p) throws Exception {

Output outRightCpu = p.profile("-d 2 -e cpu-clock -i 100ms --total -o collapsed --record-cpu");
assertCloseTo(outRightCpu.total(), 2_000_000_000, "perf_events total should match profiling duration");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not actually test the feature.
I suggest you pin the process to CPU 1 and check that output contains [CPU-1], but not [CPU-0].

Copy link
Contributor Author

@JugadK JugadK May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thought you mean't by "sanity test" just something that sees if it does not crash, will do

jkhajuria and others added 3 commits May 12, 2025 19:18
Signed-off-by: Andrei Pangin <1749416+apangin@users.noreply.github.com>
@apangin apangin merged commit 7c3aa59 into async-profiler:master May 13, 2025
17 checks passed
@apangin
Copy link
Member

apangin commented May 13, 2025

All good now. Thank you for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants