KEMBAR78
Wall interference by Baraa-Hasheesh · Pull Request #1417 · async-profiler/async-profiler · GitHub
Skip to content

Conversation

Baraa-Hasheesh
Copy link
Contributor

@Baraa-Hasheesh Baraa-Hasheesh commented Jul 28, 2025

Description

If an external actor interrupts the sleep inside the wall engine that would result in incorrect sampling frequency,
This can be easily observed by starting two different profilers in wall mode or simply manually signaling the process to interrupt the sleep

This is a simple change to make the sleep uninterruptible,
I've avoided declaring this as a common function to allow interruption when the profiler is stopped when stop is called (need to check _running)

Related issues

N/A

Motivation and context

avoid changing the sampling frequency when an external actor interrupts the profiler sleep

How has this been tested?

manual testing


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Baraa-Hasheesh Baraa-Hasheesh marked this pull request as draft July 28, 2025 11:54
@Baraa-Hasheesh Baraa-Hasheesh marked this pull request as ready for review July 28, 2025 14:48

void WallClock::uninterruptibleSleep(u64 nanos) {
struct timespec ts = {(time_t)(nanos / 1000000000), (long)(nanos % 1000000000)};
while (nanosleep(&ts, &ts) < 0 && errno == EINTR && _running);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you read BUGS section of the nanosleep man page? I think it describes exactly this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm you mean this bug right?

If a program that catches signals and uses nanosleep() receives
       signals at a very high rate, then scheduling delays and rounding
       errors in the kernel's calculation of the sleep interval and the
       returned remain value mean that the remain value may steadily
       increase on successive restarts of the nanosleep() call.  To avoid
       such problems, use clock_nanosleep(2) with the TIMER_ABSTIME flag
       to sleep to an absolute deadline.

-> As I don't believe we should care about the Linux 2.4 problem

Regarding the "very high rate" it sounds pretty vague & I don't know if async-profiler maximum rate is considered "very high" (async-profiler has a minimum sleep that basically caps the max rate)

But it does make sense to change the logic for external actors other than the profiler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also don't know how high the rate should be to see the effect of the bug - maybe it's not an issue at all, but at least it's worth checking.

Copy link
Contributor Author

@Baraa-Hasheesh Baraa-Hasheesh Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did find that using nanosleep could potentially lead to an infinite loop, as for some reason the remaining time is reported as a value higher than the originally requested sleep time when the interruption happens too quickly (Yup 🤦)
=> I noticed the issue happens when the interruption happens in periods less than 60 micro seconds
=> Please note if you try to make the sleep uninterruptible while (nanosleep(&ts, &ts) < 0 && errno == EINTR) , the remainining time could potentially keep increasing if the interruption keep happening too frequently

I would vote to change to the clock_nanosleep function as the documentation suggests, as the TIMER_ABSTIME has build in protection against such possible cases

If flags is TIMER_ABSTIME, then t is interpreted as an absolute
       time as measured by the clock, clockid.  If t is less than or
       equal to the current value of the clock, then clock_nanosleep()
       returns immediately without suspending the calling thread.

=> I will check what's the macOS equivalent/needed changes

I did my testing using the following code

#include <cstdio>
#include <time.h>
#include <pthread.h>
#include <unistd.h>
#include <signal.h>

typedef unsigned long long u64;

void sigio_handler(int signum) {
}

typedef struct {
    pthread_t thread_id;
    u64 time;
} thread_args_t;

u64 nanotime() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (u64)ts.tv_sec * 1000000000 + ts.tv_nsec;
}

void* sleepyThread(void *args) {
    while (true) {
        u64 start = nanotime();
        struct timespec ts = {1, 0};
        nanosleep(&ts, &ts);
        u64 sleptTime = nanotime() - start;
        u64 remaining = (ts.tv_sec * 1000000000ULL) + ts.tv_nsec;
        fprintf(stderr, "Slept for %llu, Remaining = %llu\n", sleptTime, remaining);
    }
}

void* alarmThread(void* args) {
    thread_args_t thread_args = *(thread_args_t*)args;

    u64 time = thread_args.time;

    u64 start = nanotime();
    while (true) {
        if (nanotime() - start >= time) {
            start = nanotime();
            pthread_kill(thread_args.thread_id, SIGIO);
        }
    }
}

int main() {
    signal(SIGIO, sigio_handler);
    pthread_t thread1;
    pthread_create(&thread1, NULL, sleepyThread, NULL);

    thread_args_t args;
    args.thread_id = thread1;
    args.time = 1000 * 60; // Change to adjust how frequently we signal the thread 

    pthread_t thread2;
    pthread_create(&thread2, NULL, alarmThread, &args);

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nanosleep issue doesn't seem to really effect macOS from my testing

The Apple documentation doesn't reference similar issues https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/nanosleep.2.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here ba1f5c5

@Baraa-Hasheesh Baraa-Hasheesh marked this pull request as draft August 12, 2025 13:45
}

void J9WallClock::uninterruptibleSleep(u64 nanos) const {
#ifdef __APPLE__
Copy link
Contributor Author

@Baraa-Hasheesh Baraa-Hasheesh Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apangin I did play around with creating this function in the os files
But decided against it for 2 main reasons:

  • I want to avoid creating an uninterruptible sleep in a common location as that looks like a possible function that could be miss used in future developments

  • As mentioned previously we should check the _running flag, while it's possible to pass to a theoretical common function I didn't like any of the implementations I thought of (lambda, pass by reference)

Copy link
Member

@apangin apangin Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two issues with your current implementation:

  1. non-trivial piece of code is copy-pasted verbatim across two places;
  2. platform-specific #ifdefs in shared code.

Individually they are not too terrible, but overall solution becomes not very nice.

I agree that passing a pointer to _running flag is not ideal, but IMO it's better than the current approach.
I'd propose to move uninterruptibleSleep to os_<platform>.cpp with a pointer to the running flag as an argument. In the future, if we need uninterruptibleSleep somewhere else, we can generify this with a function argument, but until then, let's keep it simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 1c9f6c7

@Baraa-Hasheesh Baraa-Hasheesh marked this pull request as ready for review August 13, 2025 07:53
@apangin apangin merged commit 0e551b0 into async-profiler:master Aug 13, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants