KEMBAR78
[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference by sunnyqgg · Pull Request #7363 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@sunnyqgg
Copy link
Collaborator

@sunnyqgg sunnyqgg commented Aug 29, 2025

Summary by CodeRabbit

  • New Features

    • First-draft support in speculative decoding with resource-manager awareness (Eagle3-compatible).
    • Requests can be marked as first-draft to affect drafting behavior and token handling.
    • CUDA graph flow now returns and uses a richer graph key to distinguish batch/draft scenarios.
  • Refactor

    • Preserves original draft lengths for correct token budgeting and warmup coverage.
    • Unified draft generation path and simplified forward flow (no special gating for CUDA graphs).

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@sunnyqgg sunnyqgg requested review from a team as code owners August 29, 2025 07:12
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 29, 2025

📝 Walkthrough

Walkthrough

Adds first-draft awareness and Eagle3 resource-manager integration; extends CUDA-graph keying to support 2- and 3-tuple keys, updates CUDA-graph APIs to be key-based, and propagates draft/original draft-length handling through model engine, drafter, and speculative interfaces.

Changes

Cohort / File(s) Summary
CUDA graph runner API / keying
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
Introduces multi-part graph keys Union[Tuple[int,int], Tuple[int,int,int]], CUDA_GRAPH_DUMMY_REQUEST_ID, get_graph_key(batch_size, spec_resource_manager). maybe_get_cuda_graph now accepts spec_resource_manager and returns (can_use, attn_metadata, spec_metadata, key). needs_capture/capture/replay and internal dicts updated to use the key.
Model engine: draft handling & CUDA-graph integration
tensorrt_llm/_torch/pyexecutor/model_engine.py
Preserves original_max_draft_len for token accounting, adjusts warmup and draft-length lists, introduces first_draft_requests handling and indices, passes spec_resource_manager into speculative decision, sets Eagle3ResourceManager.is_first_draft during warmup, and switches CUDA-graph flow to use key-based needs_capture/capture/replay.
Speculative interface update
tensorrt_llm/_torch/speculative/interface.py
Adds BaseResourceManager param, removes needs_kv_cache_recompute and extend_ctx, and changes attention_need_spec_dec_mode signature/logic to consider spec_resource_manager and is_draft_model (EAGLE3/EAGLE3_ONE_MODEL cases).
Model drafter flow
tensorrt_llm/_torch/speculative/model_drafter.py
Removes separate "no accepted tokens" path; unifies via accepted-tokens branch with zero-padding to max_draft_tokens. Marks draft requests with py_is_first_draft, clears first-draft flags on others, removes CUDA-graph gating and always forwards draft model (uses no_cuda_graph).
LlmRequest first-draft flag
tensorrt_llm/_torch/pyexecutor/llm_request.py
Adds optional is_first_draft argument stored as py_is_first_draft and propagated to child requests.
Executor creation tweak
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
Removes forced zeroing of draft_spec_config.max_draft_len when creating draft/speculative executor.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Drafter as ModelDrafter
  participant Engine as ModelEngine
  participant RM as Eagle3ResourceManager
  participant Spec as SpecDecMode
  participant CG as CUDAGraphRunner

  Note over Drafter: Build batch\n(flag py_is_first_draft set on requests)
  Drafter->>Engine: forward(batch, new_tensors_device?)
  Engine->>RM: set is_first_draft based on draft_len
  Engine->>Spec: attention_need_spec_dec_mode(RM, is_draft_model)
  Spec-->>Engine: decision

  Engine->>CG: maybe_get_cuda_graph(batch, RM)
  CG-->>Engine: (can_use, attn_meta, spec_meta, key)

  alt can_use and CG.needs_capture(key)
    Engine->>CG: capture(key, forward_fn, inputs)
    CG-->>Engine: output_ref
  else can_use
    Engine->>CG: replay(key, inputs)
    CG-->>Engine: output_ref
  else not usable
    Engine->>Engine: run eager forward
  end
Loading
sequenceDiagram
  autonumber
  participant Drafter as ModelDrafter
  participant Engine as ModelEngine

  Note over Drafter: For non-context requests\npad input_tokens to max_draft_tokens
  Drafter->>Drafter: _create_accepted_tokens_request\n(set py_is_first_draft=True, pad tokens)
  Drafter->>Engine: _forward_draft_model(no_cuda_graph)
  Engine-->>Drafter: logits / draft outputs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • yweng0828
  • mikeiovine
  • ziyixiong-nv
  • yuxianq
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (10)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

303-304: Plumb and document the new flag across creators; consider explicit docstring.

You added is_first_draft but it’s not documented and isn’t set in executor_request_to_llm_request (defaults False). If external creators should be able to set it, add a short param docstring and, if applicable, thread it from ExecutorRequest; otherwise, clarify it’s internal-only.

 class LlmRequest(...):
-    def __init__(..., is_first_draft: Optional[bool] = False, **kwargs):
+    def __init__(..., is_first_draft: Optional[bool] = False, **kwargs):
+        """...
+        Args:
+            ...
+            is_first_draft: Internal flag used by Eagle3 first-draft flow.
+        """

1-1: Missing NVIDIA copyright header.

Per repo guidelines, prepend the 2025 NVIDIA copyright header to this file.

tensorrt_llm/_torch/speculative/interface.py (1)

1-1: Missing NVIDIA copyright header.

Add the header as required.

tensorrt_llm/_torch/speculative/model_drafter.py (1)

1-1: Missing NVIDIA copyright header.

Add the header as required.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)

56-69: Graph key derivation: handle non-Eagle3 managers and keep bool out of semantics unless needed.

Logic is fine. Minor: add a space after comma, and prefer returning a 2-tuple unless the boolean truly changes shapes to reduce graph proliferation.

-    def get_graph_key(self, batch_size,spec_resource_manager: Optional[BaseResourceManager] = None):
+    def get_graph_key(
+        self, batch_size: int, spec_resource_manager: Optional[BaseResourceManager] = None
+    ):

93-108: maybe_get_cuda_graph: update docstring and fix long lines.

Docstring mentions 3-tuple but you return 4. Also address E501 by breaking long returns.

-        Returns a tuple containing:
-        - A boolean indicating if a graph can be used.
-        - The attn_metadata for the graph, if applicable.
-        - The spec_metadata for the graph, if applicable.
+        Returns:
+            (can_use: bool, attn_metadata, spec_metadata, key)
@@
-            return False, None, None,None
+            return False, None, None, None
@@
-                return False, None, None,None
+                return False, None, None, None
@@
-            return False, None, None,None
+            return False, None, None, None

147-150: needs_capture(key): accessor looks good; keep signature consistent.

Minor nit: remove trailing blank line.


183-188: Avoid print in library code; use logger.

Replace print with logger.debug to integrate with existing logging.

-        print(f"capture graph {key}")
+        from tensorrt_llm.logger import logger
+        logger.debug("capture graph %s", key)

1-1: Missing NVIDIA copyright header.

Add the header as required.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1-1: Missing NVIDIA copyright header.

Add the header as required.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 091b67a and a47b775.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (8 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (12 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (0 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (2 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (3 hunks)
💤 Files with no reviewable changes (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces only; indent 4 spaces

Files:

  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports; prefer from package.subpackage import module then module.Symbol
Python file names use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variable names use snake_case; if starting with a number, prefix k (e.g., k_99th_percentile)
Global variables use G_ prefix and UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings; reserve comments for internal code or local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline as shown
Avoid reflection when simple, explicit code suffices (e.g., prefer def make_complex(x,y) over locals()/dict tricks)
Catch the narrowest exceptions possible in try/except
For duck-typing try/except, keep try body minimal and use else for main logic

Files:

  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/speculative/interface.py (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • BaseResourceManager (58-81)
tensorrt_llm/_torch/speculative/model_drafter.py (3)
tensorrt_llm/runtime/generation.py (1)
  • max_draft_tokens (1283-1286)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
  • LlmRequestState (45-201)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • forward (79-87)
  • forward (2155-2246)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)
tensorrt_llm/_torch/utils.py (1)
  • make_weak_ref (79-93)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
  • ResourceManager (982-1017)
  • ResourceManagerType (46-51)
  • BaseResourceManager (58-81)
tensorrt_llm/_torch/pyexecutor/scheduler.py (3)
  • ScheduledRequests (18-39)
  • batch_size (35-36)
  • can_run_cuda_graph (31-32)
tensorrt_llm/_torch/speculative/eagle3.py (1)
  • Eagle3ResourceManager (18-77)
tensorrt_llm/_torch/attention_backend/interface.py (1)
  • create_cuda_graph_metadata (285-325)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)
tensorrt_llm/_torch/speculative/eagle3.py (2)
  • Eagle3ResourceManager (18-77)
  • forward (281-392)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)
  • spec_metadata (71-72)
  • maybe_get_cuda_graph (93-145)
  • needs_capture (147-149)
  • capture (151-203)
  • replay (205-232)
tensorrt_llm/_torch/speculative/interface.py (1)
  • attention_need_spec_dec_mode (95-99)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

50-50: Undefined name Request

(F821)


58-58: Line too long (132 > 120)

(E501)


93-93: Line too long (122 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (15)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

357-358: Ensure child requests carry the intended first-draft state.

py_is_first_draft is copied in create_child_request via the generic py_* cloning, which is fine. Verify downstream logic resets it when converting prefill→gen (it does in the drafter), and that no paths rely on the C++ side for this flag.

tensorrt_llm/_torch/speculative/interface.py (1)

10-10: Import is fine; keep this type strictly for typing to avoid cycles.

Using BaseResourceManager here is OK. If circular import ever appears, guard it with TYPE_CHECKING.

tensorrt_llm/_torch/speculative/model_drafter.py (3)

123-126: First-draft flag semantics look right; verify only first iteration is marked.

You set py_is_first_draft=True here and later flip it to False after the first forward. That matches the intended CUDA graph keying. Keep an eye on retries/recapture paths to ensure it’s not accidentally re-marked.


360-362: Correctly clearing first-draft markers.

Resetting py_is_first_draft to False before merging context→generation is correct and prevents stale keys in CUDA graph selection. Good.


232-238: Guard device lookup or pass explicit tensor device
Replace the ambiguous lookup with a safe getattr, or—if the engine expects the tensors’ device—expose and pass that field (e.g. previous_batch.tensors.device):

-        new_tensors_device = previous_batch.device if previous_batch else None
+        new_tensors_device = getattr(previous_batch, "device", None) if previous_batch else None

If draft_model_engine.forward requires the full SampleStateTensors, add an explicit previous_batch.tensors property on SampleState and use its .device.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)

44-49: Keyed graph maps: good generalization.

Supporting both (batch, draft_len) and (batch, draft_len, is_first) is sensible for Eagle3. Keep key construction consistent across ranks.


135-146: Use key[1] as max_draft_tokens: OK; ensure attn backend agrees with Eagle3 semantics.

Passing draft_len from key into create_cuda_graph_metadata matches the updated interface.


274-288: Padding dummy request: also register with SPEC_RESOURCE_MANAGER (done) and free on clear.

You add the dummy to spec manager; ensure it’s freed when clearing graphs (clear() currently resets structures but not freeing the kv/spec side dummy). Consider freeing or letting managers clean on next warmup.

tensorrt_llm/_torch/pyexecutor/model_engine.py (7)

396-406: Use of original_max_draft_len for buffer sizing: good change.

This decouples runtime mutations from allocation sizing.


778-780: Warmup coverage for draft engines: good.

Capturing graphs for original_max_draft_len in draft engines is necessary for first-draft replay.


797-804: Spec resource manager first-draft toggling: verify multi-rank consistency.

You flip Eagle3ResourceManager.is_first_draft during warmup. Ensure all ranks see the same flag before capture to avoid divergent keys.


1283-1294: Partitioning first_draft_requests: solid, but ensure py_is_first_draft is set only for those.

Upstream drafter sets py_is_first_draft. This partitioning aligns with CUDA-graph keying.


1661-1663: Accounting for first-draft requests in token totals: OK.

Including len(first_draft_requests) in num_generation_tokens is reasonable for scheduling/metrics.


2175-2179: Updated spec-dec flag wiring: OK with the None-safe fix in interface.py.

Once the None-safe guard lands, this callsite is correct.


2201-2243: Key-based CUDA graph integration: correct usage.

Adopts needs_capture/capture/replay with the new key; looks consistent with the runner.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 29, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
tensorrt_llm/_torch/speculative/interface.py (1)

78-89: Fix None-safety and unreachable return; clarify EAGLE3 logic.

  • spec_resource_manager may be None and BaseResourceManager doesn’t declare is_first_draft; use getattr with a default.
  • The second return is dead code.
  • Make the parameter Optional and return a bool explicitly.
-    def attention_need_spec_dec_mode(self,
-                                     spec_resource_manager: BaseResourceManager,
-                                     is_draft_model: bool):
+    def attention_need_spec_dec_mode(
+            self,
+            spec_resource_manager: Optional[BaseResourceManager],
+            is_draft_model: bool) -> bool:
@@
-        return self.is_eagle3_one_model() or (
-            self.is_eagle3() and spec_resource_manager.is_first_draft) or (
-                self.is_eagle3() and not is_draft_model)
-        return self.is_eagle3_one_model() or (
-            self.is_eagle3() and spec_resource_manager.is_first_draft)
+        if self.is_eagle3_one_model():
+            return True
+        if not self.is_eagle3():
+            return False
+        is_first = getattr(spec_resource_manager, "is_first_draft", False)
+        return is_first or not is_draft_model
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

294-296: Don’t mutate shared spec_config; also guard None.

This can NPE when spec_config is None and leaks mutations across engines.

-        self.original_max_draft_len = spec_config.max_draft_len
-        if spec_config is not None and is_draft_model:
-            spec_config.max_draft_len = 0
-        self.spec_config = spec_config
+        self.original_max_draft_len = spec_config.max_draft_len if spec_config else 0
+        self.spec_config = copy.deepcopy(spec_config) if spec_config else None
+        if self.spec_config is not None and is_draft_model:
+            self.spec_config.max_draft_len = 0
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a47b775 and 932927d.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (13 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/speculative/interface.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent Python with 4 spaces; no tabs
Preserve module namespaces when importing: from package.subpackage import foo; then call foo.SomeClass() instead of importing the class directly
Python naming: files snake_case; classes PascalCase; functions/methods snake_case; locals snake_case (prefix k_ when starting with a number); globals UPPER_SNAKE_CASE with G_ prefix; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; limit comments to function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline so Sphinx can render them
Avoid reflection when simpler alternatives exist; prefer explicit parameters and return dicts over locals()/dynamic tricks
In try/except, catch the narrowest exceptions possible; keep try bodies minimal and use else for the main logic when doing duck-typing checks

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/speculative/interface.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)
tensorrt_llm/_torch/speculative/eagle3.py (1)
  • Eagle3ResourceManager (18-77)
tensorrt_llm/_torch/pyexecutor/llm_request.py (3)
  • append (79-98)
  • append (125-142)
  • get_draft_token_length (560-571)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)
  • spec_metadata (71-72)
  • maybe_get_cuda_graph (93-145)
  • needs_capture (147-149)
  • capture (151-203)
  • replay (205-232)
tensorrt_llm/_torch/speculative/interface.py (1)
  • attention_need_spec_dec_mode (78-88)
tensorrt_llm/_torch/speculative/interface.py (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • BaseResourceManager (58-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

769-780: CUDA-graph warmup coverage looks good; confirm memory headroom.

Adding original_max_draft_len for draft engines increases capture shapes. Validate graph pool capacity on large drafts to avoid recapture/fallback.

Would you like a small script to scan WARMUP_STEPS, capture shapes, and peak pool usage from logs to ensure no recapture occurs?


1656-1656: Double-check token metrics to avoid double-counting
Ensure num_generation_tokens = len(generation_requests) + len(extend_requests) + sum(draft_lens) + len(first_draft_requests) does not overlap with downstream counts—verify that schedulers or your logging/metrics (e.g., in process_gen_iterlog.py) don’t also add draft_lens or accepted-draft tokens separately.

@sunnyqgg sunnyqgg changed the title [TRTLLM-7330][feat]Eagle3 cuda graph support for the 1st draft model inference [TRTLLM-7330][feat] Eagle3 cuda graph support for the 1st draft model inference Sep 1, 2025
@sunnyqgg
Copy link
Collaborator Author

sunnyqgg commented Sep 1, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17132 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17132 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12883 completed with status: 'FAILURE'

@sunnyqgg sunnyqgg force-pushed the qgai/eagle-cuda-graph branch from cf549d0 to 90b7e17 Compare September 1, 2025 13:17
@sunnyqgg sunnyqgg changed the title [TRTLLM-7330][feat] Eagle3 cuda graph support for the 1st draft model inference [TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference Sep 1, 2025
@sunnyqgg sunnyqgg changed the title [TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference [TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference Sep 1, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
tensorrt_llm/_torch/speculative/model_drafter.py (1)

119-126: Fix padding: handle Tensor/list, avoid negative pad, and don’t mutate upstream token container

input_tokens.extend(...) will fail when input_tokens is a Tensor and may compute a negative pad. Convert to a list, clamp pad to ≥0, and pass the copy to the request.

Apply this diff:

-        #pad input_tokens to max_draft_tokens
-        input_tokens.extend(
-            0 for _ in range(self.max_draft_tokens - num_accepted_tokens))
-        new_request = self._create_draft_request(request, input_tokens)
+        # Ensure list and pad to max_draft_tokens
+        if isinstance(input_tokens, torch.Tensor):
+            tokens = input_tokens.tolist()
+        else:
+            tokens = list(input_tokens)
+        pad = max(0, self.max_draft_tokens - num_accepted_tokens)
+        tokens.extend([0] * pad)
+        new_request = self._create_draft_request(request, tokens)
         new_request.state = LlmRequestState.GENERATION_IN_PROGRESS
-        new_request.py_num_accepted_draft_tokens = request.py_num_accepted_draft_tokens
+        new_request.py_num_accepted_draft_tokens = request.py_num_accepted_draft_tokens
         new_request.py_is_first_draft = True
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

16-18: Add guarded import for LlmRequest used in annotations

Ensures the forward-ref resolves for type checkers without causing import cycles.

 if TYPE_CHECKING:
     from .model_engine import PyTorchModelEngine
+    from .llm_request import LlmRequest

54-55: Fix undefined type in annotation (Ruff F821): use LlmRequest in TYPE_CHECKING

Optional["Request"] is undefined at runtime and breaks lint/CI.

Apply this diff here:

-        self.padding_dummy_request: Optional["Request"] = None
+        self.padding_dummy_request: Optional["LlmRequest"] = None

And add the guarded import (see separate comment on Lines 16-18).

🧹 Nitpick comments (7)
tensorrt_llm/_torch/speculative/model_drafter.py (2)

115-118: Docstring/state mismatch for accepted-tokens path

Docstring says “chunked context request,” but state is set to GENERATION_IN_PROGRESS. Confirm intent and align doc/comment accordingly.


1-1: Add NVIDIA copyright header

Per repo guidelines, prepend the header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)

27-30: Update class docstring to reflect new 2- or 3-tuple graph keys

Doc still says “(batch size, draft_len)”; keys can also be (batch_size, draft_len, is_first_draft).

-    multiple graphs, keyed by (batch size, draft_len).
+    multiple graphs, keyed by (batch_size, draft_len) or
+    (batch_size, draft_len, is_first_draft) for Eagle3 first-draft flow.

101-112: Update maybe_get_cuda_graph docstring to include returned key

The function now returns 4 values (including the key).

-        Returns a tuple containing:
+        Returns a tuple containing:
         - A boolean indicating if a graph can be used.
         - The attn_metadata for the graph, if applicable.
         - The spec_metadata for the graph, if applicable.
+        - The graph key used for lookup/capture/replay.

19-21: Correct misleading comment about dummy request ID

(1 << 64) - 1 isn’t a prime; it’s a sentinel. Update the comment.

-# A large prime number used for dummy request IDs to avoid collisions
+# Sentinel 64-bit ID used for CUDA-graph dummy requests to avoid collisions
 CUDA_GRAPH_DUMMY_REQUEST_ID = (1 << 64) - 1

198-198: Use logger instead of print

Prefer repo logger to avoid stdout noise.

-        print(f"capture graph {key}")
+        logger.debug(f"capture graph {key}")

Add the import near the other imports:

+from tensorrt_llm.logger import logger

1-1: Add NVIDIA copyright header

Per repo guidelines, prepend the header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 932927d and 90b7e17.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (8 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent Python with 4 spaces; no tabs
Preserve module namespaces when importing: from package.subpackage import foo; then call foo.SomeClass() instead of importing the class directly
Python naming: files snake_case; classes PascalCase; functions/methods snake_case; locals snake_case (prefix k_ when starting with a number); globals UPPER_SNAKE_CASE with G_ prefix; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; limit comments to function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline so Sphinx can render them
Avoid reflection when simpler alternatives exist; prefer explicit parameters and return dicts over locals()/dynamic tricks
In try/except, catch the narrowest exceptions possible; keep try bodies minimal and use else for the main logic when doing duck-typing checks

Files:

  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (7)
tensorrt_llm/_torch/expert_statistic.py (1)
  • ExpertStatistic (10-98)
tensorrt_llm/_torch/modules/multi_stream_utils.py (1)
  • with_multi_stream (26-32)
tensorrt_llm/_torch/speculative/eagle3.py (1)
  • Eagle3ResourceManager (18-77)
tensorrt_llm/_torch/utils.py (2)
  • make_weak_ref (79-93)
  • piecewise_cuda_graph (271-277)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
  • BaseResourceManager (58-81)
  • ResourceManager (982-1017)
  • ResourceManagerType (46-51)
tensorrt_llm/_torch/pyexecutor/scheduler.py (3)
  • batch_size (35-36)
  • ScheduledRequests (18-39)
  • can_run_cuda_graph (31-32)
tensorrt_llm/_torch/speculative/interface.py (1)
  • create_cuda_graph_metadata (157-168)
tensorrt_llm/_torch/speculative/model_drafter.py (4)
tensorrt_llm/runtime/generation.py (1)
  • max_draft_tokens (1283-1286)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
  • LlmRequestState (45-201)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • forward (79-87)
  • forward (2149-2241)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
  • forward (1442-1449)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

53-53: Undefined name Request

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/speculative/model_drafter.py (1)

360-362: First-draft flag reset — confirm lifecycle

Resetting py_is_first_draft = False only on pre-existing generation requests seems right after the first forward. Please confirm no new “first draft” requests are appended later in this iteration that should remain True.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

171-175: Verify token-per-request formula

token_per_request = key[1] + 1 assumes +1 for the next token; confirm this matches attn/spec metadata expectations for first-draft vs subsequent graphs.

Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things seem a bit tightly coupled with eagle3 right now. I'm wondering if we can generalize a bit:

  1. At the very least, we can remove all isinstance(spec_resource_manager, Eagle3ResourceManager) calls. We can instead use the more general spec_mode.needs_kv_cache_recompute() function.

  2. It's not really ideal that we're adding yet another implicit request type in ModelEngine's prepare inputs. Now we have 5 paths: extend (generation w/ draft), generation, context, chunked context, first draft requests. Is there a way to clean this up? I had something more generic in mind where ModelEngine would just expose some limited support for running chunked context + CUDA graphs. This would have the advantage of automatically working with the new CDL stuff.

@mikeiovine mikeiovine requested a review from QiJune September 2, 2025 21:11
@sunnyqgg sunnyqgg force-pushed the qgai/eagle-cuda-graph branch from 90b7e17 to 376b696 Compare September 4, 2025 07:15
@sunnyqgg sunnyqgg requested a review from longlee0622 September 4, 2025 08:00
@sunnyqgg
Copy link
Collaborator Author

sunnyqgg commented Sep 4, 2025

Things seem a bit tightly coupled with eagle3 right now. I'm wondering if we can generalize a bit:

  1. At the very least, we can remove all isinstance(spec_resource_manager, Eagle3ResourceManager) calls. We can instead use the more general spec_mode.needs_kv_cache_recompute() function.
  2. It's not really ideal that we're adding yet another implicit request type in ModelEngine's prepare inputs. Now we have 5 paths: extend (generation w/ draft), generation, context, chunked context, first draft requests. Is there a way to clean this up? I had something more generic in mind where ModelEngine would just expose some limited support for running chunked context + CUDA graphs. This would have the advantage of automatically working with the new CDL stuff.

Hi Mike,

  1. isinstance(spec_resource_manager, Eagle3ResourceManager) is used because spec_resource_manager.is_first_draft will be used later and it's specified to eagle3 with older pipeline, we can optimize it using spec_mode.needs_kv_cache_recompute() when the older pipeline is deprecated, what do you think about this?
  2. first draft requests are generation requests,I tried to fuse it into extend_requests and finally gave up cause it's clearer to separate it.
    context loop: general context and chunked context
    extend loop: target model verification
    generation loop: general generation and draft model inference except first step
    first draft requests loop: draft model inference for the first step and draft model infer for all steps in the new ChainDrafter pipeline

@sunnyqgg sunnyqgg force-pushed the qgai/eagle-cuda-graph branch from 31578df to 155bf00 Compare September 7, 2025 13:37
@sunnyqgg sunnyqgg force-pushed the qgai/eagle-cuda-graph branch from 1a2014e to cc66c07 Compare September 8, 2025 08:27
@mikeiovine
Copy link
Collaborator

Thanks! We should just address the nitpicks before merging. Accepted to unblock

@mikeiovine mikeiovine dismissed yweng0828’s stale review September 24, 2025 19:00

Requested changes have been applied

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg sunnyqgg force-pushed the qgai/eagle-cuda-graph branch from f1d8c45 to a7f98d3 Compare September 25, 2025 02:16
@sunnyqgg
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19867 [ run ] triggered by Bot

@sunnyqgg sunnyqgg force-pushed the qgai/eagle-cuda-graph branch from a7f98d3 to 010bdfb Compare September 25, 2025 02:35
@sunnyqgg
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19871 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19867 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14950 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19871 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14954 completed with status: 'FAILURE'

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg sunnyqgg force-pushed the qgai/eagle-cuda-graph branch from 010bdfb to 4b3fe62 Compare September 25, 2025 08:27
@sunnyqgg
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19918 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19918 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14994 completed with status: 'FAILURE'

@sunnyqgg
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19958 [ run ] triggered by Bot

@sunnyqgg
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19961 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19958 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15027 (Blue Ocean) completed with status: ABORTED

@mikeiovine mikeiovine removed the Community want to contribute PRs initiated from Community label Sep 25, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19961 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15028 completed with status: 'FAILURE'

@sunnyqgg
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20002 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20002 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15064 completed with status: 'SUCCESS'

@QiJune QiJune merged commit 2e5850c into NVIDIA:main Sep 26, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants