[TRTLLM-7183][test] Feature fix model issue for disagg serving #7785

fredricz-20070104 · 2025-09-17T05:49:19Z

Summary by CodeRabbit

Tests
- Added a readiness check in performance tests to wait for the server’s health endpoint before running client/benchmark steps, improving stability.
- Increased server startup and readiness timeouts to accommodate larger models and reduce test flakiness.
- Updated the perf sanity suite to use FP8-specific variants and adjusted model targets for better coverage and relevance.
- Overall improves reliability and predictability of performance test runs without changing public interfaces.

a) Fix timeout issue for the test case.
b) Fix model name issue for the test case.
c) Add wait for endpoint ready function to make sure disagg_serve endpoint is in working status when start testing

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 · 2025-09-17T05:55:11Z

/bot run

coderabbitai · 2025-09-17T05:55:33Z

📝 Walkthrough

Walkthrough

Increases server start timeout from 1800s to 3600s in perf test command construction, adds an HTTP health readiness wait before running client/benchmark in disaggregated perf tests, and updates QA perf sanity YAML to swap specific FP8 and model variant test entries.

Changes

Cohort / File(s)	Summary of changes
Perf test lifecycle adjustments `tests/integration/defs/perf/test_perf.py`	Doubles server start timeout to 3600 seconds in disaggregated client command path. No other logic changes.
Disaggregated perf utils: readiness gate `tests/integration/defs/perf/utils.py`	Adds `wait_for_endpoint_ready(url: str, timeout: int = 600)` polling HTTP 200 via requests; integrates a pre-run health check `http://localhost:8000/health` with 1800s timeout into `run_cmd` after server launch, before client/benchmark.
QA perf sanity list updates `tests/integration/test_lists/qa/llm_perf_sanity.yml`	Replaces two test entries: remove `deepseek_v3_lite-disagg_server` and `llama_v3.1_70b-disagg_server`; add `deepseek_v3_lite_fp8-disagg_server` and `llama_v3.1_8b-disagg_server` in two sections.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor T as Test Runner
  participant P as PerfDisaggScriptTestCmds
  participant S as Disagg Server (proc)
  participant H as Health Endpoint (/health)
  participant C as Client
  participant B as Benchmark

  T->>P: run_cmd()
  P->>S: launch server process
  Note over P,S: New step: readiness gate
  P->>H: poll GET /health (<=1800s)
  alt 200 OK
    H-->>P: 200 Ready
    P->>C: execute client
    P->>B: execute benchmark
  else timeout / error
    H-->>P: non-200/timeout
    P->>T: report error/timeout
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description only contains a short @coderabbitai summary with three brief points and does not follow the repository template: it is missing the detailed "Description" explaining the issue and fix, a "Test Coverage" section naming the relevant tests, and a completed "PR Checklist" as required by the template.	Please expand the PR description to match the template by adding a clear Description with implementation details and rationale (mention the increased server timeout and the new wait_for_endpoint_ready behavior), a Test Coverage section listing affected tests/files (e.g., tests/integration/defs/perf/test_perf.py, tests/integration/defs/perf/utils.py, and the updated YAML entries), and a completed PR Checklist including any manual verification steps or risks.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "[TRTLLM-7871][test]Feature fix model issue for disagg serving" is related to the PR's main changes—fixing model-name issues, addressing a timeout, and adding an endpoint-readiness wait—so it conveys the primary intent, but its phrasing ("Feature fix") is awkward and not fully aligned with the repository's conventional style.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/integration/defs/perf/utils.py (1)

334-369: Guard process cleanup to avoid UnboundLocalError on early failures.

If any context in the multi-with fails to enter, server_proc/ctx_workers_proc/gen_workers_proc may be undefined; the finally block will then raise, hiding the root cause.

-        try:
+        ctx_workers_proc = None
+        gen_workers_proc = None
+        server_proc = None
+        try:
             with (  # Start ctx workers
                 open('output_ctx.log', 'w') as output_ctx,
                 popen(self.ctx_cmd,
                       stdout=output_ctx,
                       stderr=subprocess.STDOUT,
                       env=venv._new_env,
                       shell=True) as ctx_workers_proc,
                 # Start gen workers
                 open('output_gen.log', 'w') as output_gen,
                 popen(self.gen_cmd,
                       stdout=output_gen,
                       stderr=subprocess.STDOUT,
                       env=venv._new_env,
                       shell=True) as gen_workers_proc,
                 # Start server
                 open('output_server.log', 'w') as output_server,
                 popen(self.server_cmd,
                       stdout=output_server,
                       stderr=subprocess.STDOUT,
                       env=venv._new_env,
                       shell=True) as server_proc):
                 ...
         finally:
-            server_proc.terminate()
-            ctx_workers_proc.terminate()
-            gen_workers_proc.terminate()
-            server_proc.wait()
-            ctx_workers_proc.wait()
-            gen_workers_proc.wait()
+            for proc in (server_proc, ctx_workers_proc, gen_workers_proc):
+                if proc is not None:
+                    proc.terminate()
+                    proc.wait()

🧹 Nitpick comments (2)

tests/integration/defs/perf/utils.py (2)
358-361: Remove unnecessary f-string; consider sourcing host/port from server_config.

Fixes F541 and avoids drift between config and hard-coded URL. Optional: derive host/port from the generated server_config.yaml.
-                self.wait_for_endpoint_ready(
-                    f"http://localhost:8000/health",
-                    timeout=1800)  # 30 minutes for large models
+                self.wait_for_endpoint_ready(
+                    "http://localhost:8000/health",
+                    timeout=1800)  # 30 minutes for large models
1-2: Update header year to 2025 to match guidelines.

Keep the header range current across all source files.
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6983e8a and 3b4d2a9.

📒 Files selected for processing (3)

tests/integration/defs/perf/test_perf.py (1 hunks)
tests/integration/defs/perf/utils.py (3 hunks)
tests/integration/test_lists/qa/llm_perf_sanity.yml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tests/integration/defs/perf/test_perf.py
tests/integration/defs/perf/utils.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tests/integration/defs/perf/test_perf.py
tests/integration/defs/perf/utils.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tests/integration/defs/perf/test_perf.py
tests/integration/defs/perf/utils.py

🧬 Code graph analysis (1)

tests/integration/defs/perf/utils.py (1)

tests/integration/defs/stress_test/stress_test.py (1)

url (79-81)

🪛 Ruff (0.12.2)

tests/integration/defs/perf/utils.py

326-326: Probable use of requests call without timeout

(S113)

329-329: Do not catch blind exception: Exception

(BLE001)

359-359: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tests/integration/defs/perf/test_perf.py (1)

2009-2010: LGTM: extended server start timeout to 3600s.

Aligned with longer startup for large models; consistent with server -t/-r flags.

tests/integration/test_lists/qa/llm_perf_sanity.yml (1)

223-225: LGTM: swapped disagg tests to FP8/8B variants.

Entries map to existing MODEL_PATH_DICT keys and the disagg label format is valid.

tests/integration/defs/perf/utils.py

tensorrt-cicd · 2025-09-17T06:00:57Z

PR_Github #18904 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-17T08:03:18Z

PR_Github #18904 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14170 completed with status: 'FAILURE'

ruodil · 2025-09-18T01:52:35Z

/bot run

tensorrt-cicd · 2025-09-18T01:57:57Z

PR_Github #19072 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-18T04:25:43Z

PR_Github #19072 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14308 completed with status: 'FAILURE'

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 · 2025-09-18T10:01:41Z

/bot run

tensorrt-cicd · 2025-09-18T10:06:41Z

PR_Github #19181 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-18T14:26:32Z

PR_Github #19181 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14395 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

…A#7785) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 added 5 commits September 16, 2025 09:29

fix: fx issue

30c4a1c

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

try to fix llama 70b timeout issues

5bd038c

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

use a little bit small model for H100 memory compliance

1051bd5

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

reduce the waiting timeout for 8000 disagg serve

01a8d18

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fix: fix timeout and mode name issue for disagg serving

3b4d2a9

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

fredricz-20070104 changed the title ~~[TRTLLM-7871][test]Feature fix model issue for disagg serving~~ [TRTLLM-7183][test]Feature fix model issue for disagg serving Sep 17, 2025

fredricz-20070104 changed the title ~~[TRTLLM-7183][test]Feature fix model issue for disagg serving~~ [TRTLLM-7183][test] Feature fix model issue for disagg serving Sep 17, 2025

fredricz-20070104 requested a review from bo-nv September 17, 2025 05:55

fredricz-20070104 requested a review from ruodil September 17, 2025 05:55

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

tests/integration/defs/perf/utils.py Show resolved Hide resolved

ruodil approved these changes Sep 17, 2025

View reviewed changes

bo-nv approved these changes Sep 17, 2025

View reviewed changes

get upstream latest code

05f239d

Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

ruodil merged commit fc4e6d3 into NVIDIA:main Sep 19, 2025
5 checks passed

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 19, 2025

[TRTLLM-7183][test] Feature fix model issue for disagg serving (NVIDI…

b5f1393

…A#7785) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025

[TRTLLM-7183][test] Feature fix model issue for disagg serving (NVIDI…

444cae4

…A#7785) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

MrGeva pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Sep 21, 2025

[TRTLLM-7183][test] Feature fix model issue for disagg serving (NVIDI…

9ee3344

…A#7785) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

[TRTLLM-7183][test] Feature fix model issue for disagg serving #7785

[TRTLLM-7183][test] Feature fix model issue for disagg serving #7785

Uh oh!

Conversation

fredricz-20070104 commented Sep 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

fredricz-20070104 commented Sep 17, 2025

Uh oh!

coderabbitai bot commented Sep 17, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

ruodil commented Sep 18, 2025

Uh oh!

tensorrt-cicd commented Sep 18, 2025

Uh oh!

tensorrt-cicd commented Sep 18, 2025

Uh oh!

fredricz-20070104 commented Sep 18, 2025

Uh oh!

tensorrt-cicd commented Sep 18, 2025

Uh oh!

tensorrt-cicd commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fredricz-20070104 commented Sep 17, 2025 •

edited by coderabbitai bot

Loading