KEMBAR78
Add support for system wide process sampling for linux by rk-kmr · Pull Request #1411 · async-profiler/async-profiler · GitHub
Skip to content

Conversation

@rk-kmr
Copy link
Contributor

@rk-kmr rk-kmr commented Jul 25, 2025

Description

This PR adds support for system wide process sampling for linux. Sampling implementations scans /proc directory to determine the set of processes and that reads various proc files[1] per pid to collect stats. We expose this feature by adding a proc command line argument. And this feature is supported for only JFR output format.

[1] Proc files read.

/proc/{pid}/cmdline
/proc/{pid}/stat
/proc/{pid}/status
/proc/{pid}/io

Related issues

N/A

Motivation and context

Async-profiler traditionally focuses on profiling individual Java applications, but understanding application performance often requires broader system context. This enables capturing system-wide process stats alongside application profiling, allowing users to correlate application bottlenecks with system resource contention from other processes.
comprehensive system visibility within the familiar JFR format.

How has this been tested?

I have added tests and manually tested the different test cases under varying system load conditions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@rk-kmr rk-kmr changed the title [Linux] Add process sampling and JFR capture for /proc filesystem data Add support for system process sampling for linux Jul 25, 2025
@rk-kmr rk-kmr changed the title Add support for system process sampling for linux Add support for system wide process sampling in linux Jul 25, 2025
@rk-kmr rk-kmr changed the title Add support for system wide process sampling in linux Add support for system wide process sampling for linux Jul 25, 2025
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

@fandreuz
Copy link
Contributor

Hi @rk-kmr, I noticed some of the clang-tidy checks we recently introduced in our codebase are a bit too harsh, I removed two of them in #1419.

Rohitash Kumar and others added 2 commits July 31, 2025 17:26
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Rohitash Kumar <1857610+rk-kmr@users.noreply.github.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

Rohitash Kumar added 5 commits August 20, 2025 11:42
In alipne `=` are not present in jfr output.
```
  cmdLine = "dd if /dev/zero of /dev/null bs 1M status none"
 ```
vs others
```
  cmdLine = "dd if=/dev/zero of=/dev/null bs=1M status=none"
```
void enable(const long sampling_interval) {
_sampling_interval = sampling_interval;
u64 expected_next = _last_sample_time + sampling_interval;
if (_next_sample_time != expected_next) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check can be simplified to

if (_sampling_interval != sampling_interval) {
       _sampling_interval = sampling_interval;
       _last_sample_time = 0;
       _process_history.clear();
}

Please note that the current logic doesn't really protect against stale data,

  • start,cpu,proc=30
  • Profile for 2 minutes
  • stop
  • No profiles for for 5 minutes
  • start,cpu,proc=30

The _process_history won't be cleared even by all accounts it should be cleared as it contains stale data
I would recommend changing the condition to something along the following lines

if (_sampling_interval != sampling_interval || (_next_sample_time + sampling_interval) < wall_time) {
       _sampling_interval = sampling_interval;
       _last_sample_time = 0;
       _process_history.clear();
}
  • _sampling_interval != sampling_interval checks for a change in the interval
  • (_next_sample_time + sampling_interval) < wall_time checks if the profiler was turned off for an extended duration

Note: In the check above I'm checking for exactly one missing sample as an example, but 1 missing sample might be fine, I would recommend checking a proper safety margin at which we should consider the data in our _process_history stale & clear it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition looks suspicious indeed, since _next_sample_time is always _last_sample_time + _sampling_interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apangin what are your thoughts about the stale data issue?

As described here #1411 (comment)
previously the loop option was indirectly modifying the sampling frequency, this was fixed in recent changes,

The fix basically converted to the engine to a static one with static data which are preserved between profiling sessions as to bypass the loop restart.

If the user uses adhoc profiling via asprof, it's quite reasonable to assume the profiler might have a long lag between profiling session.

So what should be an acceptable threshold to keep using the old historical data from previous profiling session?

void enable(const long sampling_interval) {
_sampling_interval = sampling_interval;
u64 expected_next = _last_sample_time + sampling_interval;
if (_next_sample_time != expected_next) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition looks suspicious indeed, since _next_sample_time is always _last_sample_time + _sampling_interval.

@apangin apangin merged commit b30f5f1 into async-profiler:master Sep 6, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants