[None] [feat] Update disagg gen-only benchmark. #7917

qiaoxj07 · 2025-09-23T01:44:12Z

Sleep after kv-cache transfer to avoid using multiple ctx instances.

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>

qiaoxj07 · 2025-09-23T01:49:33Z

/bot run

coderabbitai · 2025-09-23T01:54:38Z

📝 Walkthrough

Walkthrough

Adds a benchmark-aware forward-gating mechanism in _executor_loop_overlap: removes an initial blocking wait, introduces a can_forward flag, and delays scheduling until request queues reach benchmark thresholds, with attention data-parallel (DP) specific allgather coordination. Warmup handling remains, integrating the new gating before normal scheduling proceeds.

Changes

Cohort / File(s)	Summary of changes
PyExecutor benchmark gating and scheduling loop `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Removed initial startup blocking wait; introduced can_forward flag. Added gating logic: during benchmarking, delay forward until queues are primed. With attention_dp: compute local_can_forward, allgather across ranks, conditional sleeps (10s/30s) and then enable forward. Without attention_dp: wait until generation requests >= benchmark_req_queues_size. Warmup path preserved; normal scheduling unchanged once can_forward is True. No public API changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User/Benchmark Driver
  participant P as PyExecutor
  participant Q as RequestQueue
  participant W as World (Ranks)
  note over P: _executor_loop_overlap start (benchmark mode)

  U->>Q: Enqueue benchmark requests
  P->>P: Initialize can_forward (False if benchmark+KV transceiver)
  alt is_warmup
    P->>P: Warmup handling (unchanged)
  else not warmup and can_forward == False
    alt attention_dp enabled
      P->>Q: Inspect fetched/scheduled requests
      P->>W: allgather(local_can_forward)
      alt all ranks can_forward
        P->>P: sleep(30s)
        P->>P: can_forward = True
      else waiting
        Note right of P: rank0 logs progress
        P->>P: sleep(10s)
        P-->>P: continue loop
      end
    else attention_dp disabled
      P->>Q: Check generation requests count
      alt count < benchmark_req_queues_size
        Note right of P: rank0 logs waiting
        P->>P: sleep(10s)
        P-->>P: continue loop
      else ready
        P->>P: can_forward = True
      end
    end
  end
  opt can_forward == True
    P->>Q: Normal scheduling/processing
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is essentially the repository template and placeholder guidance (CodeRabbit AI template) and does not contain a concrete Description, Test Coverage details, or a filled PR Checklist; because required sections are missing or left as prompts, the description does not meet the repository's template requirements.	Please replace the template placeholders with a clear Description of the change and rationale (mention the gating/sleep logic added in tensorrt_llm/_torch/pyexecutor/py_executor.py), list explicit Test Coverage or test names/commands that verify the new behavior, include a ticket/issue in the title per the repo template, and confirm any required CODEOWNERS or documentation updates before merging.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title accurately reflects the primary change—adding a sleep in the disaggregated (disagg) generation-only benchmark to avoid creating multiple ctx instances after kv-cache transfer—and maps to the changes in tensorrt_llm/_torch/pyexecutor/py_executor.py which introduce gating and sleep logic; the phrasing is slightly awkward but still related to the changeset.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

1164-1172: Make sleeps interruptible and configurable; avoid 30s uninterruptible stall.

Hard-coded time.sleep(30) blocks shutdown and delays responsiveness. Use shutdown_event.wait() and env-configurable durations.

Apply:

-                        if all(all_can_forward):
-                            time.sleep(30)
-                            can_forward = True
+                        if all(all_can_forward):
+                            ready_sleep_s = int(os.getenv("TLLM_BENCHMARK_READY_SLEEP_SEC", "30"))
+                            if self.shutdown_event.wait(ready_sleep_s):
+                                break
+                            can_forward = True
                         else:
-                            if self.dist.rank == 0:
-                                logger.info(
-                                    f"sleep 10 seconds, num_fetched_requests: {self.executor_request_queue.num_fetch_requests}, scheduled_gen_batch: {len(scheduled_batch.generation_requests)}"
-                                )
-                            time.sleep(10)
+                            wait_sleep_s = int(os.getenv("TLLM_BENCHMARK_WAIT_SLEEP_SEC", "10"))
+                            if self.dist.rank == 0:
+                                logger.info(
+                                    f"sleep {wait_sleep_s} seconds, "
+                                    f"num_fetched_requests: {self.executor_request_queue.num_fetch_requests}, "
+                                    f"scheduled_gen_batch: {len(scheduled_batch.generation_requests)}, "
+                                    f"threshold: {self.benchmark_req_queues_size}"
+                                )
+                            if self.shutdown_event.wait(wait_sleep_s):
+                                break
                             continue

1174-1183: Fix incorrect log metric and make wait interruptible in non-attention-DP path.

Logged num_fetched_requests uses len(scheduled_batch.generation_requests) instead of the fetched metric; also use shutdown_event.wait().

-                        if len(scheduled_batch.generation_requests
-                               ) < self.benchmark_req_queues_size:
-                            if self.dist.rank == 0:
-                                logger.info(
-                                    f"sleep 10 seconds, num_fetched_requests: {len(scheduled_batch.generation_requests)}, scheduled_gen_batch: {len(scheduled_batch.generation_requests)}"
-                                )
-                            time.sleep(10)
+                        if len(scheduled_batch.generation_requests) < self.benchmark_req_queues_size:
+                            wait_sleep_s = int(os.getenv("TLLM_BENCHMARK_WAIT_SLEEP_SEC", "10"))
+                            if self.dist.rank == 0:
+                                logger.info(
+                                    f"sleep {wait_sleep_s} seconds, "
+                                    f"num_fetched_requests: {self.executor_request_queue.num_fetch_requests}, "
+                                    f"scheduled_gen_batch: {len(scheduled_batch.generation_requests)}, "
+                                    f"threshold: {self.benchmark_req_queues_size}"
+                                )
+                            if self.shutdown_event.wait(wait_sleep_s):
+                                break
                             continue

1159-1172: Prefer progress-based readiness over fixed 30s delay (avoid over/under-waiting).

Sleeping a fixed 30s after consensus may be too long or too short depending on KV transfer size and link bandwidth. Consider polling transfer status (e.g., any(req.is_disagg_generation_transmission_in_progress)) with a bounded wait to proceed as soon as transfers complete.

I can draft a bounded poll loop using shutdown_event.wait(wait_sleep_s) and a max polls guard if you want.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9da4203 and 4346605.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

is_warmup (511-512)

is_warmup (515-518)

tensorrt_llm/_torch/distributed/communicator.py (3)

allgather (101-102)

allgather (336-337)

rank (29-30)

tensorrt_llm/mapping.py (2)

rank (328-329)

rank (332-339)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1155-1161: Double-counting risk in readiness predicate; please confirm intent.

local_can_forward = executor_request_queue.num_fetch_requests + len(scheduled_batch.generation_requests) >= threshold. If num_fetch_requests is already a global total, adding the local scheduled count may overcount. Non-attention-DP path only uses scheduled count, which is inconsistent.

Would you prefer to:

Compare only num_fetch_requests (global) to the threshold, or

Allgather the local scheduled count and sum across attention‑DP ranks?
I can provide a patch once you confirm the intended semantics.

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt-cicd · 2025-09-23T01:54:44Z

PR_Github #19636 [ run ] triggered by Bot

tensorrt_llm/_torch/pyexecutor/py_executor.py

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>

qiaoxj07 · 2025-09-23T02:56:50Z

/bot run

tensorrt-cicd · 2025-09-23T03:01:59Z

PR_Github #19650 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-23T03:02:01Z

PR_Github #19636 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14771 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-09-23T22:25:43Z

PR_Github #19650 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14785 completed with status: 'SUCCESS'

qiaoxj07 · 2025-09-24T01:06:25Z

/bot reuse-pipeline

tensorrt-cicd · 2025-09-24T01:28:59Z

PR_Github #19732 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-09-24T02:06:08Z

PR_Github #19732 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #19650 for commit 55b9b26

tensorrt_llm/_torch/pyexecutor/py_executor.py

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>

qiaoxj07 · 2025-09-25T09:57:11Z

/bot run

tensorrt-cicd · 2025-09-25T10:02:57Z

PR_Github #19934 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-25T13:38:22Z

PR_Github #19934 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15006 completed with status: 'FAILURE'

qiaoxj07 · 2025-09-26T01:21:24Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-26T01:28:02Z

PR_Github #20009 [ run ] triggered by Bot

qiaoxj07 · 2025-09-26T05:00:03Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-26T05:05:39Z

PR_Github #20040 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-26T05:05:41Z

PR_Github #20009 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15070 (Blue Ocean) completed with status: ABORTED

qiaoxj07 · 2025-09-26T08:40:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-26T08:46:19Z

PR_Github #20074 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-26T08:46:21Z

PR_Github #20040 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15096 (Blue Ocean) completed with status: ABORTED

qiaoxj07 · 2025-09-26T13:13:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-26T13:18:35Z

PR_Github #20096 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-26T13:18:37Z

PR_Github #20074 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15124 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-09-26T20:10:03Z

PR_Github #20096 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15142 completed with status: 'FAILURE'

qiaoxj07 · 2025-09-27T01:12:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-27T01:18:56Z

PR_Github #20132 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-27T04:20:31Z

PR_Github #20132 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15175 completed with status: 'SUCCESS'

update disagg genonly benchmark

4346605

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>

qiaoxj07 requested a review from a team as a code owner September 23, 2025 01:44

qiaoxj07 requested review from Naveassaf, Shixiaowei02 and kaiyux September 23, 2025 01:44

qiaoxj07 changed the title ~~Update disagg gen-only benchmark, sleep after kv-cache transfer to avoid use multiple ctx instances.~~ Update disagg gen-only benchmark, sleep after kv-cache transfer to avoid using multiple ctx instances. Sep 23, 2025

qiaoxj07 changed the title ~~Update disagg gen-only benchmark, sleep after kv-cache transfer to avoid using multiple ctx instances.~~ [None] [feat] Update disagg gen-only benchmark. Sep 23, 2025

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

qiaoxj07 requested a review from bo-nv September 23, 2025 02:03

Shixiaowei02 reviewed Sep 23, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

qiaoxj07 added 2 commits September 22, 2025 19:45

update

a0e58b5

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>

update

7006c71

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>

kaiyux requested review from a team, QiJune, Tabrizian and chuangz0 and removed request for a team September 24, 2025 02:54

kaiyux reviewed Sep 24, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

update

abb209c

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>

qiaoxj07 force-pushed the disagg-genonly branch from 55b9b26 to abb209c Compare September 25, 2025 09:55

Merge branch 'main' into disagg-genonly

e4ac479

Merge branch 'main' into disagg-genonly

6e1dd9c

Merge branch 'main' into disagg-genonly

859501c

kaiyux approved these changes Sep 28, 2025

View reviewed changes

qiaoxj07 merged commit c8f98b3 into NVIDIA:main Sep 28, 2025
5 checks passed

[None] [feat] Update disagg gen-only benchmark. #7917

[None] [feat] Update disagg gen-only benchmark. #7917

Uh oh!

Conversation

qiaoxj07 commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qiaoxj07 commented Sep 23, 2025

Uh oh!

coderabbitai bot commented Sep 23, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Sep 23, 2025

Uh oh!

Uh oh!

qiaoxj07 commented Sep 23, 2025

Uh oh!

tensorrt-cicd commented Sep 23, 2025

Uh oh!

tensorrt-cicd commented Sep 23, 2025

Uh oh!

tensorrt-cicd commented Sep 23, 2025

Uh oh!

qiaoxj07 commented Sep 24, 2025

Uh oh!

tensorrt-cicd commented Sep 24, 2025

Uh oh!

tensorrt-cicd commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

qiaoxj07 commented Sep 25, 2025

Uh oh!

tensorrt-cicd commented Sep 25, 2025

Uh oh!

tensorrt-cicd commented Sep 25, 2025

Uh oh!

qiaoxj07 commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

qiaoxj07 commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

qiaoxj07 commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

qiaoxj07 commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

qiaoxj07 commented Sep 27, 2025

Uh oh!

tensorrt-cicd commented Sep 27, 2025

Uh oh!

tensorrt-cicd commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

qiaoxj07 commented Sep 23, 2025 •

edited

Loading