-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference #7363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds first-draft awareness and Eagle3 resource-manager integration; extends CUDA-graph keying to support 2- and 3-tuple keys, updates CUDA-graph APIs to be key-based, and propagates draft/original draft-length handling through model engine, drafter, and speculative interfaces. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Drafter as ModelDrafter
participant Engine as ModelEngine
participant RM as Eagle3ResourceManager
participant Spec as SpecDecMode
participant CG as CUDAGraphRunner
Note over Drafter: Build batch\n(flag py_is_first_draft set on requests)
Drafter->>Engine: forward(batch, new_tensors_device?)
Engine->>RM: set is_first_draft based on draft_len
Engine->>Spec: attention_need_spec_dec_mode(RM, is_draft_model)
Spec-->>Engine: decision
Engine->>CG: maybe_get_cuda_graph(batch, RM)
CG-->>Engine: (can_use, attn_meta, spec_meta, key)
alt can_use and CG.needs_capture(key)
Engine->>CG: capture(key, forward_fn, inputs)
CG-->>Engine: output_ref
else can_use
Engine->>CG: replay(key, inputs)
CG-->>Engine: output_ref
else not usable
Engine->>Engine: run eager forward
end
sequenceDiagram
autonumber
participant Drafter as ModelDrafter
participant Engine as ModelEngine
Note over Drafter: For non-context requests\npad input_tokens to max_draft_tokens
Drafter->>Drafter: _create_accepted_tokens_request\n(set py_is_first_draft=True, pad tokens)
Drafter->>Engine: _forward_draft_model(no_cuda_graph)
Engine-->>Drafter: logits / draft outputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (10)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
303-304: Plumb and document the new flag across creators; consider explicit docstring.You added is_first_draft but it’s not documented and isn’t set in executor_request_to_llm_request (defaults False). If external creators should be able to set it, add a short param docstring and, if applicable, thread it from ExecutorRequest; otherwise, clarify it’s internal-only.
class LlmRequest(...): - def __init__(..., is_first_draft: Optional[bool] = False, **kwargs): + def __init__(..., is_first_draft: Optional[bool] = False, **kwargs): + """... + Args: + ... + is_first_draft: Internal flag used by Eagle3 first-draft flow. + """
1-1: Missing NVIDIA copyright header.Per repo guidelines, prepend the 2025 NVIDIA copyright header to this file.
tensorrt_llm/_torch/speculative/interface.py (1)
1-1: Missing NVIDIA copyright header.Add the header as required.
tensorrt_llm/_torch/speculative/model_drafter.py (1)
1-1: Missing NVIDIA copyright header.Add the header as required.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)
56-69: Graph key derivation: handle non-Eagle3 managers and keep bool out of semantics unless needed.Logic is fine. Minor: add a space after comma, and prefer returning a 2-tuple unless the boolean truly changes shapes to reduce graph proliferation.
- def get_graph_key(self, batch_size,spec_resource_manager: Optional[BaseResourceManager] = None): + def get_graph_key( + self, batch_size: int, spec_resource_manager: Optional[BaseResourceManager] = None + ):
93-108: maybe_get_cuda_graph: update docstring and fix long lines.Docstring mentions 3-tuple but you return 4. Also address E501 by breaking long returns.
- Returns a tuple containing: - - A boolean indicating if a graph can be used. - - The attn_metadata for the graph, if applicable. - - The spec_metadata for the graph, if applicable. + Returns: + (can_use: bool, attn_metadata, spec_metadata, key) @@ - return False, None, None,None + return False, None, None, None @@ - return False, None, None,None + return False, None, None, None @@ - return False, None, None,None + return False, None, None, None
147-150: needs_capture(key): accessor looks good; keep signature consistent.Minor nit: remove trailing blank line.
183-188: Avoid print in library code; use logger.Replace print with logger.debug to integrate with existing logging.
- print(f"capture graph {key}") + from tensorrt_llm.logger import logger + logger.debug("capture graph %s", key)
1-1: Missing NVIDIA copyright header.Add the header as required.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1-1: Missing NVIDIA copyright header.Add the header as required.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py(8 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py(2 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(12 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(0 hunks)tensorrt_llm/_torch/speculative/interface.py(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py(3 hunks)
💤 Files with no reviewable changes (1)
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces only; indent 4 spaces
Files:
tensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/speculative/interface.pytensorrt_llm/_torch/speculative/model_drafter.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports; prefer from package.subpackage import module then module.Symbol
Python file names use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variable names use snake_case; if starting with a number, prefix k (e.g., k_99th_percentile)
Global variables use G_ prefix and UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings; reserve comments for internal code or local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline as shown
Avoid reflection when simple, explicit code suffices (e.g., prefer def make_complex(x,y) over locals()/dict tricks)
Catch the narrowest exceptions possible in try/except
For duck-typing try/except, keep try body minimal and use else for main logic
Files:
tensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/speculative/interface.pytensorrt_llm/_torch/speculative/model_drafter.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/speculative/interface.pytensorrt_llm/_torch/speculative/model_drafter.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytensorrt_llm/_torch/pyexecutor/model_engine.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/speculative/interface.py (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
BaseResourceManager(58-81)
tensorrt_llm/_torch/speculative/model_drafter.py (3)
tensorrt_llm/runtime/generation.py (1)
max_draft_tokens(1283-1286)cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
LlmRequestState(45-201)tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
forward(79-87)forward(2155-2246)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)
tensorrt_llm/_torch/utils.py (1)
make_weak_ref(79-93)tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
ResourceManager(982-1017)ResourceManagerType(46-51)BaseResourceManager(58-81)tensorrt_llm/_torch/pyexecutor/scheduler.py (3)
ScheduledRequests(18-39)batch_size(35-36)can_run_cuda_graph(31-32)tensorrt_llm/_torch/speculative/eagle3.py (1)
Eagle3ResourceManager(18-77)tensorrt_llm/_torch/attention_backend/interface.py (1)
create_cuda_graph_metadata(285-325)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)
tensorrt_llm/_torch/speculative/eagle3.py (2)
Eagle3ResourceManager(18-77)forward(281-392)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)
spec_metadata(71-72)maybe_get_cuda_graph(93-145)needs_capture(147-149)capture(151-203)replay(205-232)tensorrt_llm/_torch/speculative/interface.py (1)
attention_need_spec_dec_mode(95-99)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
50-50: Undefined name Request
(F821)
58-58: Line too long (132 > 120)
(E501)
93-93: Line too long (122 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (15)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
357-358: Ensure child requests carry the intended first-draft state.py_is_first_draft is copied in create_child_request via the generic py_* cloning, which is fine. Verify downstream logic resets it when converting prefill→gen (it does in the drafter), and that no paths rely on the C++ side for this flag.
tensorrt_llm/_torch/speculative/interface.py (1)
10-10: Import is fine; keep this type strictly for typing to avoid cycles.Using BaseResourceManager here is OK. If circular import ever appears, guard it with TYPE_CHECKING.
tensorrt_llm/_torch/speculative/model_drafter.py (3)
123-126: First-draft flag semantics look right; verify only first iteration is marked.You set py_is_first_draft=True here and later flip it to False after the first forward. That matches the intended CUDA graph keying. Keep an eye on retries/recapture paths to ensure it’s not accidentally re-marked.
360-362: Correctly clearing first-draft markers.Resetting py_is_first_draft to False before merging context→generation is correct and prevents stale keys in CUDA graph selection. Good.
232-238: Guarddevicelookup or pass explicit tensor device
Replace the ambiguous lookup with a safegetattr, or—if the engine expects the tensors’ device—expose and pass that field (e.g.previous_batch.tensors.device):- new_tensors_device = previous_batch.device if previous_batch else None + new_tensors_device = getattr(previous_batch, "device", None) if previous_batch else NoneIf
draft_model_engine.forwardrequires the full SampleStateTensors, add an explicitprevious_batch.tensorsproperty onSampleStateand use its.device.tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
44-49: Keyed graph maps: good generalization.Supporting both (batch, draft_len) and (batch, draft_len, is_first) is sensible for Eagle3. Keep key construction consistent across ranks.
135-146: Use key[1] as max_draft_tokens: OK; ensure attn backend agrees with Eagle3 semantics.Passing draft_len from key into create_cuda_graph_metadata matches the updated interface.
274-288: Padding dummy request: also register with SPEC_RESOURCE_MANAGER (done) and free on clear.You add the dummy to spec manager; ensure it’s freed when clearing graphs (clear() currently resets structures but not freeing the kv/spec side dummy). Consider freeing or letting managers clean on next warmup.
tensorrt_llm/_torch/pyexecutor/model_engine.py (7)
396-406: Use of original_max_draft_len for buffer sizing: good change.This decouples runtime mutations from allocation sizing.
778-780: Warmup coverage for draft engines: good.Capturing graphs for original_max_draft_len in draft engines is necessary for first-draft replay.
797-804: Spec resource manager first-draft toggling: verify multi-rank consistency.You flip Eagle3ResourceManager.is_first_draft during warmup. Ensure all ranks see the same flag before capture to avoid divergent keys.
1283-1294: Partitioning first_draft_requests: solid, but ensure py_is_first_draft is set only for those.Upstream drafter sets py_is_first_draft. This partitioning aligns with CUDA-graph keying.
1661-1663: Accounting for first-draft requests in token totals: OK.Including len(first_draft_requests) in num_generation_tokens is reasonable for scheduling/metrics.
2175-2179: Updated spec-dec flag wiring: OK with the None-safe fix in interface.py.Once the None-safe guard lands, this callsite is correct.
2201-2243: Key-based CUDA graph integration: correct usage.Adopts needs_capture/capture/replay with the new key; looks consistent with the runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (2)
tensorrt_llm/_torch/speculative/interface.py (1)
78-89: Fix None-safety and unreachable return; clarify EAGLE3 logic.
- spec_resource_manager may be None and BaseResourceManager doesn’t declare is_first_draft; use getattr with a default.
- The second return is dead code.
- Make the parameter Optional and return a bool explicitly.
- def attention_need_spec_dec_mode(self, - spec_resource_manager: BaseResourceManager, - is_draft_model: bool): + def attention_need_spec_dec_mode( + self, + spec_resource_manager: Optional[BaseResourceManager], + is_draft_model: bool) -> bool: @@ - return self.is_eagle3_one_model() or ( - self.is_eagle3() and spec_resource_manager.is_first_draft) or ( - self.is_eagle3() and not is_draft_model) - return self.is_eagle3_one_model() or ( - self.is_eagle3() and spec_resource_manager.is_first_draft) + if self.is_eagle3_one_model(): + return True + if not self.is_eagle3(): + return False + is_first = getattr(spec_resource_manager, "is_first_draft", False) + return is_first or not is_draft_modeltensorrt_llm/_torch/pyexecutor/model_engine.py (1)
294-296: Don’t mutate shared spec_config; also guard None.This can NPE when spec_config is None and leaks mutations across engines.
- self.original_max_draft_len = spec_config.max_draft_len - if spec_config is not None and is_draft_model: - spec_config.max_draft_len = 0 - self.spec_config = spec_config + self.original_max_draft_len = spec_config.max_draft_len if spec_config else 0 + self.spec_config = copy.deepcopy(spec_config) if spec_config else None + if self.spec_config is not None and is_draft_model: + self.spec_config.max_draft_len = 0
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py(13 hunks)tensorrt_llm/_torch/speculative/interface.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/speculative/interface.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Indent Python with 4 spaces; no tabs
Preserve module namespaces when importing: from package.subpackage import foo; then call foo.SomeClass() instead of importing the class directly
Python naming: files snake_case; classes PascalCase; functions/methods snake_case; locals snake_case (prefix k_ when starting with a number); globals UPPER_SNAKE_CASE with G_ prefix; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; limit comments to function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline so Sphinx can render them
Avoid reflection when simpler alternatives exist; prefer explicit parameters and return dicts over locals()/dynamic tricks
In try/except, catch the narrowest exceptions possible; keep try bodies minimal and use else for the main logic when doing duck-typing checks
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/speculative/interface.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)
tensorrt_llm/_torch/speculative/eagle3.py (1)
Eagle3ResourceManager(18-77)tensorrt_llm/_torch/pyexecutor/llm_request.py (3)
append(79-98)append(125-142)get_draft_token_length(560-571)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)
spec_metadata(71-72)maybe_get_cuda_graph(93-145)needs_capture(147-149)capture(151-203)replay(205-232)tensorrt_llm/_torch/speculative/interface.py (1)
attention_need_spec_dec_mode(78-88)
tensorrt_llm/_torch/speculative/interface.py (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
BaseResourceManager(58-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
769-780: CUDA-graph warmup coverage looks good; confirm memory headroom.Adding original_max_draft_len for draft engines increases capture shapes. Validate graph pool capacity on large drafts to avoid recapture/fallback.
Would you like a small script to scan WARMUP_STEPS, capture shapes, and peak pool usage from logs to ensure no recapture occurs?
1656-1656: Double-check token metrics to avoid double-counting
Ensurenum_generation_tokens = len(generation_requests) + len(extend_requests) + sum(draft_lens) + len(first_draft_requests)does not overlap with downstream counts—verify that schedulers or your logging/metrics (e.g., inprocess_gen_iterlog.py) don’t also adddraft_lensor accepted-draft tokens separately.
|
/bot run |
|
PR_Github #17132 [ run ] triggered by Bot |
|
PR_Github #17132 [ run ] completed with state |
cf549d0 to
90b7e17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
tensorrt_llm/_torch/speculative/model_drafter.py (1)
119-126: Fix padding: handle Tensor/list, avoid negative pad, and don’t mutate upstream token container
input_tokens.extend(...)will fail wheninput_tokensis a Tensor and may compute a negative pad. Convert to a list, clamp pad to ≥0, and pass the copy to the request.Apply this diff:
- #pad input_tokens to max_draft_tokens - input_tokens.extend( - 0 for _ in range(self.max_draft_tokens - num_accepted_tokens)) - new_request = self._create_draft_request(request, input_tokens) + # Ensure list and pad to max_draft_tokens + if isinstance(input_tokens, torch.Tensor): + tokens = input_tokens.tolist() + else: + tokens = list(input_tokens) + pad = max(0, self.max_draft_tokens - num_accepted_tokens) + tokens.extend([0] * pad) + new_request = self._create_draft_request(request, tokens) new_request.state = LlmRequestState.GENERATION_IN_PROGRESS - new_request.py_num_accepted_draft_tokens = request.py_num_accepted_draft_tokens + new_request.py_num_accepted_draft_tokens = request.py_num_accepted_draft_tokens new_request.py_is_first_draft = Truetensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
16-18: Add guarded import for LlmRequest used in annotationsEnsures the forward-ref resolves for type checkers without causing import cycles.
if TYPE_CHECKING: from .model_engine import PyTorchModelEngine + from .llm_request import LlmRequest
54-55: Fix undefined type in annotation (Ruff F821): use LlmRequest in TYPE_CHECKING
Optional["Request"]is undefined at runtime and breaks lint/CI.Apply this diff here:
- self.padding_dummy_request: Optional["Request"] = None + self.padding_dummy_request: Optional["LlmRequest"] = NoneAnd add the guarded import (see separate comment on Lines 16-18).
🧹 Nitpick comments (7)
tensorrt_llm/_torch/speculative/model_drafter.py (2)
115-118: Docstring/state mismatch for accepted-tokens pathDocstring says “chunked context request,” but
stateis set toGENERATION_IN_PROGRESS. Confirm intent and align doc/comment accordingly.
1-1: Add NVIDIA copyright headerPer repo guidelines, prepend the header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (5)
27-30: Update class docstring to reflect new 2- or 3-tuple graph keysDoc still says “(batch size, draft_len)”; keys can also be
(batch_size, draft_len, is_first_draft).- multiple graphs, keyed by (batch size, draft_len). + multiple graphs, keyed by (batch_size, draft_len) or + (batch_size, draft_len, is_first_draft) for Eagle3 first-draft flow.
101-112: Update maybe_get_cuda_graph docstring to include returned keyThe function now returns 4 values (including the key).
- Returns a tuple containing: + Returns a tuple containing: - A boolean indicating if a graph can be used. - The attn_metadata for the graph, if applicable. - The spec_metadata for the graph, if applicable. + - The graph key used for lookup/capture/replay.
19-21: Correct misleading comment about dummy request ID
(1 << 64) - 1isn’t a prime; it’s a sentinel. Update the comment.-# A large prime number used for dummy request IDs to avoid collisions +# Sentinel 64-bit ID used for CUDA-graph dummy requests to avoid collisions CUDA_GRAPH_DUMMY_REQUEST_ID = (1 << 64) - 1
198-198: Use logger instead of printPrefer repo logger to avoid stdout noise.
- print(f"capture graph {key}") + logger.debug(f"capture graph {key}")Add the import near the other imports:
+from tensorrt_llm.logger import logger
1-1: Add NVIDIA copyright headerPer repo guidelines, prepend the header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py(8 hunks)tensorrt_llm/_torch/speculative/model_drafter.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytensorrt_llm/_torch/speculative/model_drafter.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Indent Python with 4 spaces; no tabs
Preserve module namespaces when importing: from package.subpackage import foo; then call foo.SomeClass() instead of importing the class directly
Python naming: files snake_case; classes PascalCase; functions/methods snake_case; locals snake_case (prefix k_ when starting with a number); globals UPPER_SNAKE_CASE with G_ prefix; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; limit comments to function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline so Sphinx can render them
Avoid reflection when simpler alternatives exist; prefer explicit parameters and return dicts over locals()/dynamic tricks
In try/except, catch the narrowest exceptions possible; keep try bodies minimal and use else for the main logic when doing duck-typing checks
Files:
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytensorrt_llm/_torch/speculative/model_drafter.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (7)
tensorrt_llm/_torch/expert_statistic.py (1)
ExpertStatistic(10-98)tensorrt_llm/_torch/modules/multi_stream_utils.py (1)
with_multi_stream(26-32)tensorrt_llm/_torch/speculative/eagle3.py (1)
Eagle3ResourceManager(18-77)tensorrt_llm/_torch/utils.py (2)
make_weak_ref(79-93)piecewise_cuda_graph(271-277)tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
BaseResourceManager(58-81)ResourceManager(982-1017)ResourceManagerType(46-51)tensorrt_llm/_torch/pyexecutor/scheduler.py (3)
batch_size(35-36)ScheduledRequests(18-39)can_run_cuda_graph(31-32)tensorrt_llm/_torch/speculative/interface.py (1)
create_cuda_graph_metadata(157-168)
tensorrt_llm/_torch/speculative/model_drafter.py (4)
tensorrt_llm/runtime/generation.py (1)
max_draft_tokens(1283-1286)cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
LlmRequestState(45-201)tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
forward(79-87)forward(2149-2241)tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
forward(1442-1449)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
53-53: Undefined name Request
(F821)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/speculative/model_drafter.py (1)
360-362: First-draft flag reset — confirm lifecycleResetting
py_is_first_draft = Falseonly on pre-existing generation requests seems right after the first forward. Please confirm no new “first draft” requests are appended later in this iteration that should remain True.tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
171-175: Verify token-per-request formula
token_per_request = key[1] + 1assumes +1 for the next token; confirm this matches attn/spec metadata expectations for first-draft vs subsequent graphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things seem a bit tightly coupled with eagle3 right now. I'm wondering if we can generalize a bit:
-
At the very least, we can remove all
isinstance(spec_resource_manager, Eagle3ResourceManager)calls. We can instead use the more generalspec_mode.needs_kv_cache_recompute()function. -
It's not really ideal that we're adding yet another implicit request type in
ModelEngine's prepare inputs. Now we have 5 paths: extend (generation w/ draft), generation, context, chunked context, first draft requests. Is there a way to clean this up? I had something more generic in mind whereModelEnginewould just expose some limited support for running chunked context + CUDA graphs. This would have the advantage of automatically working with the new CDL stuff.
90b7e17 to
376b696
Compare
Hi Mike,
|
31578df to
155bf00
Compare
1a2014e to
cc66c07
Compare
|
Thanks! We should just address the nitpicks before merging. Accepted to unblock |
Requested changes have been applied
Signed-off-by: qgai <qgai@nvidia.com>
f1d8c45 to
a7f98d3
Compare
|
/bot run |
|
PR_Github #19867 [ run ] triggered by Bot |
a7f98d3 to
010bdfb
Compare
|
/bot run |
|
PR_Github #19871 [ run ] triggered by Bot |
|
PR_Github #19867 [ run ] completed with state |
|
PR_Github #19871 [ run ] completed with state |
010bdfb to
4b3fe62
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #19918 [ run ] triggered by Bot |
|
PR_Github #19918 [ run ] completed with state |
|
/bot run |
|
PR_Github #19958 [ run ] triggered by Bot |
|
/bot run --disable-fail-fast |
|
PR_Github #19961 [ run ] triggered by Bot |
|
PR_Github #19958 [ run ] completed with state |
|
PR_Github #19961 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #20002 [ run ] triggered by Bot |
|
PR_Github #20002 [ run ] completed with state |
Summary by CodeRabbit
New Features
Refactor
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.