KEMBAR78
[None] [feat] Update disagg gen-only benchmark. by qiaoxj07 · Pull Request #7917 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@qiaoxj07
Copy link
Collaborator

@qiaoxj07 qiaoxj07 commented Sep 23, 2025

Sleep after kv-cache transfer to avoid using multiple ctx instances.

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
@qiaoxj07 qiaoxj07 requested a review from a team as a code owner September 23, 2025 01:44
@qiaoxj07 qiaoxj07 changed the title Update disagg gen-only benchmark, sleep after kv-cache transfer to avoid use multiple ctx instances. Update disagg gen-only benchmark, sleep after kv-cache transfer to avoid using multiple ctx instances. Sep 23, 2025
@qiaoxj07 qiaoxj07 changed the title Update disagg gen-only benchmark, sleep after kv-cache transfer to avoid using multiple ctx instances. [None] [feat] Update disagg gen-only benchmark. Sep 23, 2025
@qiaoxj07
Copy link
Collaborator Author

/bot run

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

📝 Walkthrough

Walkthrough

Adds a benchmark-aware forward-gating mechanism in _executor_loop_overlap: removes an initial blocking wait, introduces a can_forward flag, and delays scheduling until request queues reach benchmark thresholds, with attention data-parallel (DP) specific allgather coordination. Warmup handling remains, integrating the new gating before normal scheduling proceeds.

Changes

Cohort / File(s) Summary of changes
PyExecutor benchmark gating and scheduling loop
tensorrt_llm/_torch/pyexecutor/py_executor.py
Removed initial startup blocking wait; introduced can_forward flag. Added gating logic: during benchmarking, delay forward until queues are primed. With attention_dp: compute local_can_forward, allgather across ranks, conditional sleeps (10s/30s) and then enable forward. Without attention_dp: wait until generation requests >= benchmark_req_queues_size. Warmup path preserved; normal scheduling unchanged once can_forward is True. No public API changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User/Benchmark Driver
  participant P as PyExecutor
  participant Q as RequestQueue
  participant W as World (Ranks)
  note over P: _executor_loop_overlap start (benchmark mode)

  U->>Q: Enqueue benchmark requests
  P->>P: Initialize can_forward (False if benchmark+KV transceiver)
  alt is_warmup
    P->>P: Warmup handling (unchanged)
  else not warmup and can_forward == False
    alt attention_dp enabled
      P->>Q: Inspect fetched/scheduled requests
      P->>W: allgather(local_can_forward)
      alt all ranks can_forward
        P->>P: sleep(30s)
        P->>P: can_forward = True
      else waiting
        Note right of P: rank0 logs progress
        P->>P: sleep(10s)
        P-->>P: continue loop
      end
    else attention_dp disabled
      P->>Q: Check generation requests count
      alt count < benchmark_req_queues_size
        Note right of P: rank0 logs waiting
        P->>P: sleep(10s)
        P-->>P: continue loop
      else ready
        P->>P: can_forward = True
      end
    end
  end
  opt can_forward == True
    P->>Q: Normal scheduling/processing
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is essentially the repository template and placeholder guidance (CodeRabbit AI template) and does not contain a concrete Description, Test Coverage details, or a filled PR Checklist; because required sections are missing or left as prompts, the description does not meet the repository's template requirements. Please replace the template placeholders with a clear Description of the change and rationale (mention the gating/sleep logic added in tensorrt_llm/_torch/pyexecutor/py_executor.py), list explicit Test Coverage or test names/commands that verify the new behavior, include a ticket/issue in the title per the repo template, and confirm any required CODEOWNERS or documentation updates before merging.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title accurately reflects the primary change—adding a sleep in the disaggregated (disagg) generation-only benchmark to avoid creating multiple ctx instances after kv-cache transfer—and maps to the changes in tensorrt_llm/_torch/pyexecutor/py_executor.py which introduce gating and sleep logic; the phrasing is slightly awkward but still related to the changeset.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

1164-1172: Make sleeps interruptible and configurable; avoid 30s uninterruptible stall.

Hard-coded time.sleep(30) blocks shutdown and delays responsiveness. Use shutdown_event.wait() and env-configurable durations.

Apply:

-                        if all(all_can_forward):
-                            time.sleep(30)
-                            can_forward = True
+                        if all(all_can_forward):
+                            ready_sleep_s = int(os.getenv("TLLM_BENCHMARK_READY_SLEEP_SEC", "30"))
+                            if self.shutdown_event.wait(ready_sleep_s):
+                                break
+                            can_forward = True
                         else:
-                            if self.dist.rank == 0:
-                                logger.info(
-                                    f"sleep 10 seconds, num_fetched_requests: {self.executor_request_queue.num_fetch_requests}, scheduled_gen_batch: {len(scheduled_batch.generation_requests)}"
-                                )
-                            time.sleep(10)
+                            wait_sleep_s = int(os.getenv("TLLM_BENCHMARK_WAIT_SLEEP_SEC", "10"))
+                            if self.dist.rank == 0:
+                                logger.info(
+                                    f"sleep {wait_sleep_s} seconds, "
+                                    f"num_fetched_requests: {self.executor_request_queue.num_fetch_requests}, "
+                                    f"scheduled_gen_batch: {len(scheduled_batch.generation_requests)}, "
+                                    f"threshold: {self.benchmark_req_queues_size}"
+                                )
+                            if self.shutdown_event.wait(wait_sleep_s):
+                                break
                             continue

1174-1183: Fix incorrect log metric and make wait interruptible in non-attention-DP path.

Logged num_fetched_requests uses len(scheduled_batch.generation_requests) instead of the fetched metric; also use shutdown_event.wait().

-                        if len(scheduled_batch.generation_requests
-                               ) < self.benchmark_req_queues_size:
-                            if self.dist.rank == 0:
-                                logger.info(
-                                    f"sleep 10 seconds, num_fetched_requests: {len(scheduled_batch.generation_requests)}, scheduled_gen_batch: {len(scheduled_batch.generation_requests)}"
-                                )
-                            time.sleep(10)
+                        if len(scheduled_batch.generation_requests) < self.benchmark_req_queues_size:
+                            wait_sleep_s = int(os.getenv("TLLM_BENCHMARK_WAIT_SLEEP_SEC", "10"))
+                            if self.dist.rank == 0:
+                                logger.info(
+                                    f"sleep {wait_sleep_s} seconds, "
+                                    f"num_fetched_requests: {self.executor_request_queue.num_fetch_requests}, "
+                                    f"scheduled_gen_batch: {len(scheduled_batch.generation_requests)}, "
+                                    f"threshold: {self.benchmark_req_queues_size}"
+                                )
+                            if self.shutdown_event.wait(wait_sleep_s):
+                                break
                             continue

1159-1172: Prefer progress-based readiness over fixed 30s delay (avoid over/under-waiting).

Sleeping a fixed 30s after consensus may be too long or too short depending on KV transfer size and link bandwidth. Consider polling transfer status (e.g., any(req.is_disagg_generation_transmission_in_progress)) with a bounded wait to proceed as soon as transfers complete.

I can draft a bounded poll loop using shutdown_event.wait(wait_sleep_s) and a max polls guard if you want.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9da4203 and 4346605.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • is_warmup (511-512)
  • is_warmup (515-518)
tensorrt_llm/_torch/distributed/communicator.py (3)
  • allgather (101-102)
  • allgather (336-337)
  • rank (29-30)
tensorrt_llm/mapping.py (2)
  • rank (328-329)
  • rank (332-339)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1155-1161: Double-counting risk in readiness predicate; please confirm intent.

local_can_forward = executor_request_queue.num_fetch_requests + len(scheduled_batch.generation_requests) >= threshold. If num_fetch_requests is already a global total, adding the local scheduled count may overcount. Non-attention-DP path only uses scheduled count, which is inconsistent.

Would you prefer to:

  • Compare only num_fetch_requests (global) to the threshold, or
  • Allgather the local scheduled count and sum across attention‑DP ranks?
    I can provide a patch once you confirm the intended semantics.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19636 [ run ] triggered by Bot

@qiaoxj07 qiaoxj07 requested a review from bo-nv September 23, 2025 02:03
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
@qiaoxj07
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19650 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19636 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14771 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19650 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14785 completed with status: 'SUCCESS'

@qiaoxj07
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19732 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19732 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #19650 for commit 55b9b26

@kaiyux kaiyux requested review from a team, QiJune, Tabrizian and chuangz0 and removed request for a team September 24, 2025 02:54
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
@qiaoxj07
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19934 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19934 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15006 completed with status: 'FAILURE'

@qiaoxj07
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20009 [ run ] triggered by Bot

@qiaoxj07
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20040 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20009 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15070 (Blue Ocean) completed with status: ABORTED

@qiaoxj07
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20074 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20040 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15096 (Blue Ocean) completed with status: ABORTED

@qiaoxj07
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20096 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20074 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15124 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20096 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15142 completed with status: 'FAILURE'

@qiaoxj07
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20132 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20132 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15175 completed with status: 'SUCCESS'

@qiaoxj07 qiaoxj07 merged commit c8f98b3 into NVIDIA:main Sep 28, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants