-
Notifications
You must be signed in to change notification settings - Fork 938
Reduce resonance between methods in CPU smoke test #1465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test/test/smoke/Cpu.java
Outdated
|
|
||
| private static void method1() { | ||
| for (int i = 0; i < 1000000; i++) { | ||
| for (int i = 0; i < 314159; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this number improve the likelihood of getting samples in all methods?
If there's a resonance problem, how about we let all three run in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it reduces possible resonance hence reducing the possibility of all 3 methods synchronizing hence increasing the probability of sampling all
As discussed with @apangin internally that would defeat the intention of the test, which is that hot methods should be sampled even if the duration of each call is less than the sampling duration
Running all 3 in parallel would make them in practice have higher run time than the sampling duration & completely change the nature of the test from a single threaded app to a multi threaded one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain how you choose the numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed in person, thanks
|
What about |
Sure let me re-run the tests with the new duration to make sure everything is working as expected |
@apangin neither values of 7 & 3 are stable with the current iteration counts, I would recommend keeping it as 10 |
|
What does it mean it's not stable? And how does 10ms duration make it "more stable", given that it correlates with the sampling interval? |
@apangin I test the stability of a given option by simply re-running the same test 100s of times, then measure how likely it's to fail In the case of using 7,3ms the failure rate was quite high, As for why 10ms is more stable, it would require an actual full analysis of how the CPU executes the java byte code & create some models to predict when samples would happen, & I honestly didn't see the value of going through that given that the current numbers were observed to be quite stable on multiple systems |
|
Are these values tuned to a specific CPU type/size? |
I tried to run this changes across multiple CPU types & sizes & didn't observe any instabilities on the configuration I ran against |
This is contrary to the declared intention of this PR, which is supposed to reduce correlation with the sampling interval. You are saying that the test is more stable when I don't mind skipping time consuming analysis if test behavior matches expectation, but here I see the opposite. I've always said that the purpose of a test is not to "pass", but to validate assumptions. My assumption here is, if a test runs 3 methods for a comparable amount of time each, all 3 methods should be visible in a profile. Where am I wrong? What makes you sure that it's a test issue rather than a profiler bug or a JVM bug? |
The intention here is reduce possible synchronization of all 3 methods together as a single unit with the sampling interval. I would describe as reducing the chance that all 3 methods come to an agreement to actively prevent one of them from being sampled from the profiler by coordinating their run time.
This is a fair assumption but kinda goes against what the test actually runs, as the methods do not strictly run for comparable time to one another in a controlled manner, Method1 & 2 can take different times to complete depending on the system they run on (CPU resources, JDK version & optimizations) For example let's assume we change method3 to run for 3ms & method2/1 have a combined run time of 7ms on a random system. (total run time will be 3 + 7 = 10ms = sampling duration)
If it's a JVM bug it would have been viewable in the generated byte code, (For example one method is completely optimized out) Technically speaking it's a limitation of the profiler due to the realities of how sampling is collected. While this is is indeed a limitation of the profiler, I wouldn't personally consider it a bug per say Now, regarding this test, If this is to be considered a bug/limitation that needs to be fixed, that would need to be discussed in depth & don't believe that this PR is the proper place to handle that discussion. For this test in particular to reduce the chance of failures that may happen due to possible biased sampling of periodic/consistent work loads we should re-write this test to no longer be strictly periodic/consistent, IMO that would make it a better smoke test for async-profiler, as most real applications don't have strictly periodic/consistent loads, What are you thoughts about this? |
Signed-off-by: Bara' Hasheesh <bara.hasheesh@gmail.com>
Description
It was observed that it's possible for either method1 or method2 not to be sampled in certain runs
When checking the logs it was noticed that the sample distribution was 50% on method3 & the other 50% either method1 or method2.
This change makes the loop counter for each method unique to avoid possible synchronization between them & method3
Related issues
N/A
Motivation and context
reduce the possibility of coordination of timing between the methods
How has this been tested?
Test was re-run 100s of times on various systems including ones where the failure was observed
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.