-
Notifications
You must be signed in to change notification settings - Fork 937
Add support for system wide process sampling for linux #1411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy found issue(s) with the introduced code (1/1)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Rohitash Kumar <1857610+rk-kmr@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy found issue(s) with the introduced code (1/1)
In alipne `=` are not present in jfr output. ``` cmdLine = "dd if /dev/zero of /dev/null bs 1M status none" ``` vs others ``` cmdLine = "dd if=/dev/zero of=/dev/null bs=1M status=none" ```
src/flightRecorder.cpp
Outdated
| void enable(const long sampling_interval) { | ||
| _sampling_interval = sampling_interval; | ||
| u64 expected_next = _last_sample_time + sampling_interval; | ||
| if (_next_sample_time != expected_next) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check can be simplified to
if (_sampling_interval != sampling_interval) {
_sampling_interval = sampling_interval;
_last_sample_time = 0;
_process_history.clear();
}
Please note that the current logic doesn't really protect against stale data,
start,cpu,proc=30- Profile for 2 minutes
stop- No profiles for for 5 minutes
start,cpu,proc=30
The _process_history won't be cleared even by all accounts it should be cleared as it contains stale data
I would recommend changing the condition to something along the following lines
if (_sampling_interval != sampling_interval || (_next_sample_time + sampling_interval) < wall_time) {
_sampling_interval = sampling_interval;
_last_sample_time = 0;
_process_history.clear();
}
_sampling_interval != sampling_intervalchecks for a change in the interval(_next_sample_time + sampling_interval) < wall_timechecks if the profiler was turned off for an extended duration
Note: In the check above I'm checking for exactly one missing sample as an example, but 1 missing sample might be fine, I would recommend checking a proper safety margin at which we should consider the data in our _process_history stale & clear it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition looks suspicious indeed, since _next_sample_time is always _last_sample_time + _sampling_interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apangin what are your thoughts about the stale data issue?
As described here #1411 (comment)
previously the loop option was indirectly modifying the sampling frequency, this was fixed in recent changes,
The fix basically converted to the engine to a static one with static data which are preserved between profiling sessions as to bypass the loop restart.
If the user uses adhoc profiling via asprof, it's quite reasonable to assume the profiler might have a long lag between profiling session.
So what should be an acceptable threshold to keep using the old historical data from previous profiling session?
src/flightRecorder.cpp
Outdated
| void enable(const long sampling_interval) { | ||
| _sampling_interval = sampling_interval; | ||
| u64 expected_next = _last_sample_time + sampling_interval; | ||
| if (_next_sample_time != expected_next) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition looks suspicious indeed, since _next_sample_time is always _last_sample_time + _sampling_interval.
Signed-off-by: Andrei Pangin <1749416+apangin@users.noreply.github.com>
Description
This PR adds support for system wide process sampling for linux. Sampling implementations scans /proc directory to determine the set of processes and that reads various proc files[1] per pid to collect stats. We expose this feature by adding a
proccommand line argument. And this feature is supported for only JFR output format.[1] Proc files read.
Related issues
N/A
Motivation and context
Async-profiler traditionally focuses on profiling individual Java applications, but understanding application performance often requires broader system context. This enables capturing system-wide process stats alongside application profiling, allowing users to correlate application bottlenecks with system resource contention from other processes.
comprehensive system visibility within the familiar JFR format.
How has this been tested?
I have added tests and manually tested the different test cases under varying system load conditions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.