-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][infra] Improve the failure message for accuracy test suite #7994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/bot run |
|
PR_Github #19941 [ run ] triggered by Bot |
📝 WalkthroughWalkthroughIntroduces HypothesisTestingParams dataclass and replaces AccuracyTask.get_num_samples_and_threshold with AccuracyTask.get_hypothesis_testing_params. Updates evaluation/summarization to use the new object’s num_samples and threshold, adds report/assert methods, and switches various print statements to logger.info. Documentation updated to reference the new method. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Runner as Test Runner
participant Task as AccuracyTask
participant HParams as HypothesisTestingParams
participant Eval as Evaluator
Runner->>Task: get_hypothesis_testing_params(acc_specs)
Task->>HParams: construct with acc_specs
HParams-->>Task: params (num_samples, threshold, ...)
Task-->>Runner: HypothesisTestingParams
Runner->>Eval: evaluate(num_samples = params.num_samples)
Eval-->>Runner: accuracy
Runner->>HParams: report(accuracy) / assert_passing(accuracy)
HParams-->>Runner: report string / assertion outcome
sequenceDiagram
autonumber
participant Runner as Test Runner
participant Task as AccuracyTask
participant Eval as Evaluator
Note over Runner,Eval: Previous flow (removed)
Runner->>Task: get_num_samples_and_threshold(acc_specs)
Task-->>Runner: (num_samples, threshold)
Runner->>Eval: evaluate(num_samples)
Eval-->>Runner: accuracy
Runner->>Runner: compare accuracy vs threshold
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (3 warnings)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/integration/defs/accuracy/accuracy_core.py (4)
70-80: Add a concise Google‑style docstring to HypothesisTestingParams.Improves API clarity and aligns with the repo’s Python docstring guidelines.
@dataclass(slots=True) class HypothesisTestingParams: + """Hypothesis testing configuration for accuracy checks. + + Attributes: + ref_accuracy (float): Reference accuracy for threshold computation. + Typically in [0, 100] for percentage-like metrics; may be raw for + lower-is-better metrics (e.g., perplexity). + num_samples (int): Number of samples to evaluate. + alpha (float): Type I error rate in (0, 1). Defaults to 0.05. + beta (float): Type II error rate in (0, 1). Defaults to 0.2. + sigma (float): Per-sample score standard deviation scaled to the metric + range (e.g., 50.0 for Bernoulli accuracy in 0–100 scale). + higher_is_better (bool): True if larger metric is better; False otherwise. + theta (float): Computed minimum detectable effect (post-init). + threshold (float): Computed decision threshold (post-init). + """
112-119: Add failure margin to the assertion message for faster debugging.Including how far the result is from the threshold speeds triage.
def assert_passing(self, accuracy: float) -> None: compare_op = ">=" if self.higher_is_better else "<=" - err_msg = f"Reference accuracy is {self.ref_accuracy:.3f}, threshold is {self.threshold:.3f}. Expected accuracy {compare_op} threshold, but got {accuracy:.3f}. Please see hypothesis testing report:\n{self.report(accuracy)}" + margin = (accuracy - self.threshold) if self.higher_is_better else (self.threshold - accuracy) + err_msg = ( + f"Reference accuracy is {self.ref_accuracy:.3f}, threshold is {self.threshold:.3f}. " + f"Expected accuracy {compare_op} threshold (margin: {margin:.3f}), but got {accuracy:.3f}. " + f"Please see hypothesis testing report:\n{self.report(accuracy)}" + ) if self.higher_is_better: assert accuracy >= self.threshold, err_msg else: assert accuracy <= self.threshold, err_msg
148-150: Complete the docstring with Returns and Raises sections (Google style).Aligns with repo docs and clarifies contract.
- def get_hypothesis_testing_params(self, - **acc_specs) -> HypothesisTestingParams: - """Get hypothesis testing parameters via accuracy specifications. + def get_hypothesis_testing_params(self, + **acc_specs) -> HypothesisTestingParams: + """Get hypothesis testing parameters via accuracy specifications. - Args: - acc_specs: Accuracy specifications, currently including: - dtype (str): Model data type. Defaults to 'auto'. - quant_algo (str): Quantizaion algorithm. Defaults to None. - kv_cache_quant_algo (str): KV cache quantizaion algorithm. Defaults to None. - spec_dec_algo (str): Speculative decoding algorithm. Defaults to None. - extra_acc_spec (str): Extra accuracy specifications. Defaults to None. - """ + Args: + acc_specs: Accuracy specifications, currently including: + dtype (str): Model data type. Defaults to 'auto'. + quant_algo (str): Quantization algorithm. Defaults to None. + kv_cache_quant_algo (str): KV cache quantization algorithm. Defaults to None. + spec_dec_algo (str): Speculative decoding algorithm. Defaults to None. + extra_acc_spec (str): Extra accuracy specification tag. Defaults to None. + + Returns: + HypothesisTestingParams: Filled hypothesis testing parameters for this task/model/spec. + + Raises: + ValueError: If no matching reference is registered and TRTLLM_ACCURACY_NO_REFERENCE != "1". + """Also applies to: 152-159
204-210: Clarify the integration-test log message.It does not truly “skip verification”; it uses a permissive threshold. Make the message precise.
if is_integration_test: logger.info( - "Running in INTEGRATION_TEST mode: using only 1 sample and skipping accuracy verification" + "Running in INTEGRATION_TEST mode: using 1 sample and a permissive threshold (effectively skipping strict accuracy verification)" ) hypothesis_testing_params = HypothesisTestingParams(ref_accuracy=0, num_samples=1)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/integration/defs/accuracy/README.md(1 hunks)tests/integration/defs/accuracy/accuracy_core.py(12 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tests/integration/defs/accuracy/accuracy_core.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tests/integration/defs/accuracy/accuracy_core.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tests/integration/defs/accuracy/accuracy_core.py
🧬 Code graph analysis (1)
tests/integration/defs/accuracy/accuracy_core.py (4)
tensorrt_llm/evaluate/lm_eval.py (1)
evaluate(385-417)tensorrt_llm/evaluate/interface.py (1)
evaluate(81-110)tests/integration/defs/conftest.py (2)
llm_root(189-190)llm_venv(699-716)tests/integration/defs/common.py (1)
venv_check_call(28-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tests/integration/defs/accuracy/accuracy_core.py (9)
19-19: LGTM: new dataclass imports are appropriate.
93-111: LGTM: clear, user‑friendly hypothesis testing report.This will significantly improve failure diagnostics.
175-182: LGTM: centralizing parameter resolution into HypothesisTestingParams.Defaulting to task-level ALPHA/BETA/SIGMA/NUM_SAMPLES and HIGHER_IS_BETTER is clean. Combined with post_init validation, bad configs will fail fast with clear errors.
233-235: LGTM: passing num_samples into the evaluator constructor.Matches how downstream evaluators derive their limits; keeps evaluation volume consistent with hypothesis settings.
242-246: LGTM: improved assertion path with detailed report.The inline report will be invaluable for CI failures.
499-499: LGTM: switch from prints to structured logger.info.Improves log consistency and visibility in CI.
Also applies to: 601-601, 620-621, 688-689, 719-719, 727-727
637-647: LGTM: consistent use of HypothesisTestingParams in summarize flow.num_samples and threshold are consumed uniformly; the pre-log of the report is a nice touch.
688-699: LGTM: MMLU flow now uses unified hypothesis parameters.Stable and clear.
734-744: LGTM: long-context eval uses unified hypothesis parameters.Consistent with other flows; batch sizing derived safely.
tests/integration/defs/accuracy/README.md (1)
128-129: LGTM: updated reference to get_hypothesis_testing_params.Documentation now matches the new API surface.
|
PR_Github #19941 [ run ] completed with state |
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
/bot reuse-pipeline --comment "resolved trivial rebase conflict" |
|
PR_Github #20013 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #20013 [ reuse-pipeline ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.