KEMBAR78
[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by liji-nv · Pull Request #6978 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@liji-nv
Copy link
Collaborator

@liji-nv liji-nv commented Aug 18, 2025

…… (#6858)

This is a Cherry-pick from main branch.

Summary by CodeRabbit

  • New Features

    • Optional optimized attention execution path for compatible backends to improve performance.
  • Refactor

    • Centralized attention/rotary processing and removed per-branch FP8/NVFP4 output-scale wiring.
  • Bug Fixes

    • Mirror draft model layer attributes into target configuration when present to keep specs consistent.
  • Tests

    • Use a shared Torch compile configuration for FP8 accuracy tests and re-enable a previously skipped FP8 test.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@liji-nv liji-nv requested a review from a team as a code owner August 18, 2025 03:46
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Centralized attention execution by adding Attention.forward_impl and removing per-call FP8/NVFP4 out-scale wiring in LLaMA; merged draft model attn/MLA layer attrs into model_config for speculative decoding; adjusted Torch-compile test config usage and removed a global skip for an FP8 Eagle3 test.

Changes

Cohort / File(s) Summary of Changes
Attention core path refactor
tensorrt_llm/_torch/modules/attention.py
Added Attention.forward_impl to centralize attention logic, extract mrope_config, choose custom inplace op vs _attn_impl, and wrap tuple outputs as Fp4QuantizedTensor. Updated attn_custom_op_inplace call to pass enable_attn_nvfp4_output=False by name. forward now delegates to forward_impl.
LLaMA attention integration
tensorrt_llm/_torch/models/modeling_llama.py
Removed manual computation/propagation of FP8/NVFP4 out_scale/out_scale_sf. Replaced self.attn.forward(...) with self.attn.forward_impl(q, k, v, attn_metadata, attention_mask, None, None, mrope_config) and preserved downstream o_proj handling (wrapping tuple outputs as Fp4QuantizedTensor where applicable).
Speculative decoding config sync
tensorrt_llm/_torch/models/modeling_speculative.py
In SpecDecOneEngineForCausalLM.__init__, if draft_config exists, copy draft_config.extra_attrs entries (asserting keys are attn_layers or mla_layers and present) into model_config.extra_attrs via .update(). No signature changes.
Integration tests — Torch compile config
tests/integration/defs/accuracy/test_llm_api_pytorch.py
For two test_fp8_eagle3 cases, create a torch_compile_config variable (TorchCompileConfig(enable_fullgraph=True, enable_piecewise_cuda_graph=True, max_num_streams=3) when enabled) and pass it in pytorch_config instead of constructing inline.
Test waivers
tests/integration/test_lists/waives.txt
Removed the SKIP entry for accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_fp8_eagle3[tp8-torch_compile=True], making the test eligible to run.

Sequence Diagram(s)

sequenceDiagram
    participant LlamaModel as modeling_llama
    participant Attn as Attention
    participant CustomOp as attn_custom_op_inplace
    participant Impl as _attn_impl

    LlamaModel->>Attn: forward_impl(q, k, v, attn_metadata, attention_mask, None, None, mrope_config)
    alt Use custom inplace op
        Attn->>CustomOp: run(..., mrope params, enable_attn_nvfp4_output=False)
        CustomOp-->>Attn: attn_output (tensor or tuple)
    else
        Attn->>Impl: _attn_impl(..., mrope params)
        Impl-->>Attn: attn_output (tensor or tuple)
    end
    Attn-->>LlamaModel: attn_output (wrapped as Fp4QuantizedTensor if tuple)
    LlamaModel->>LlamaModel: o_proj(attn_output)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Speculative Decoding, CI

Suggested reviewers

  • hlu1
  • yuxianq
  • byshiue
  • yechank-nvidia

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6eac92e and 6938be2.

📒 Files selected for processing (1)
  • tests/integration/test_lists/waives.txt (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 18, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15578 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (4)
tensorrt_llm/_torch/models/modeling_speculative.py (2)

330-332: Raise the correct exception type

Raising NotImplemented (a sentinel) will error out with a TypeError. Use NotImplementedError.

Apply this diff:

-        raise NotImplemented(
+        raise NotImplementedError(
             f"get_draft_model does not support speculative decoding mode {spec_dec_mode}."
         )

1-1: Missing SPDX copyright header

Per repository guidelines, prepend the NVIDIA SPDX header.

Apply this patch at the top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
tensorrt_llm/_torch/modules/attention.py (1)

1-1: Missing SPDX copyright header

Per repository guidelines, prepend the NVIDIA SPDX header.

Apply this patch at the top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
tensorrt_llm/_torch/models/modeling_llama.py (1)

1-1: Missing SPDX copyright header

Per repository guidelines, prepend the NVIDIA SPDX header.

Apply this patch at the top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/attention.py (1)

401-415: Return semantics are consistent; consider documenting Fp4QuantizedTensor path

forward_impl returns either a Tensor or an Fp4QuantizedTensor (when output_sf is available). This is fine; add a short docstring or comment to set expectations for callers.

tensorrt_llm/_torch/models/modeling_llama.py (1)

179-181: Redundant tuple handling after forward_impl

forward_impl returns a Tensor (or Fp4QuantizedTensor). The tuple case won’t occur; remove for clarity.

Apply this minimal cleanup:

-        if isinstance(attn_output, tuple):
-            attn_output = Fp4QuantizedTensor(attn_output[0], attn_output[1])
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 96bda14 and 026df41.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (1 hunks)
  • tensorrt_llm/_torch/modules/attention.py (3 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/test_lists/waives.txt (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/attention.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/attention.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🔇 Additional comments (3)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

553-556: TorchCompileConfig wiring looks correct

The torch_compile_config is constructed and threaded into pytorch_config as expected. This aligns with the new torch.compile-aware attention path.

Also applies to: 561-561

tensorrt_llm/_torch/modules/attention.py (1)

95-96: Good: explicit named argument for enable_attn_nvfp4_output

Switching to a named argument improves clarity and guards against call-site drift.

tensorrt_llm/_torch/models/modeling_llama.py (1)

169-178: No unexpected attention_sinks kwargs found on forward_impl calls

I ran the provided ripgrep checks and confirmed that the only call site passing attention_sinks to forward_impl is in
tensorrt_llm/_torch/models/modeling_llama.py:169–178.
All other uses of forward_impl (including named-arg calls in flashinfer.py) only pass the expected parameters.
No further action needed here.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15578 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #171 completed with status: 'FAILURE'

@liji-nv liji-nv force-pushed the cherry-pick-18ccd05 branch from e8c599b to c7bf0fa Compare August 18, 2025 06:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tensorrt_llm/_torch/models/modeling_llama.py (2)

178-180: Remove dead tuple handling after forward_impl

forward_impl now returns either a Tensor or an Fp4QuantizedTensor (wrapping happens inside), so this tuple case is obsolete. Removing it reduces noise and avoids masking future mismatches.

-        if isinstance(attn_output, tuple):
-            attn_output = Fp4QuantizedTensor(attn_output[0], attn_output[1])

491-498: Llama4DecoderLayer does not pass attention_mask (unlike LlamaDecoderLayer) — verify encoder/prefill use

LlamaDecoderLayer sets attention_mask to FULL when not generation and passes it into self_attn. Llama4DecoderLayer doesn’t set or pass attention_mask, so it will always use the default CAUSAL. If Llama4Model can be used in encoder/prefill-only scenarios (is_generation=False), this will compute a causal mask incorrectly.

Proposed alignment with LlamaDecoderLayer:

@@
 class Llama4DecoderLayer(DecoderLayer):
@@
         self.moe_allreduce = MoEAllReduce(self.mapping)
 
+        # Match LlamaDecoderLayer behavior: use FULL mask when not generating (encoder/prefill).
+        self.attention_mask = PredefinedAttentionMask.CAUSAL
+        if not model_config.is_generation:
+            self.attention_mask = PredefinedAttentionMask.FULL
@@
         # Self Attention
         hidden_states = self.self_attn(
             position_ids=position_ids,
             hidden_states=hidden_states,
             attn_metadata=attn_metadata,
+            attention_mask=self.attention_mask,
             all_reduce_params=AllReduceParams(
                 enable_allreduce=not self.disable_attn_allreduce),
             **kwargs,
         )

If Llama4 is never used in prefill-only mode, feel free to ignore; otherwise, this avoids subtle accuracy regressions.

Also applies to: 655-659

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 026df41 and c7bf0fa.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (1 hunks)
  • tensorrt_llm/_torch/modules/attention.py (3 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/test_lists/waives.txt (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/modules/attention.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/models/modeling_llama.py (2)

169-176: Fixed: removed unsupported kwarg and switched to forward_impl — LGTM

Dropping the extraneous attention_sinks kwarg and delegating to forward_impl aligns with the new centralized attention path and should resolve the TypeError seen earlier. Passing None placeholders for output scales is consistent with removing per-call scale wiring.


169-176: forward_impl requires explicit None placeholders
The forward_impl signature in tensorrt_llm/_torch/modules/attention.py declares its last three parameters without defaults:

def forward_impl(
    self,
    q: torch.Tensor,
    k: Optional[torch.Tensor],
    v: Optional[torch.Tensor],
    attn_metadata: AttentionMetadata,
    attention_mask: AttentionMask,
    attention_window_size: Optional[int],
    attention_mask_data: Optional[torch.Tensor],
    mrope_config: Optional[dict],
):

Because neither attention_window_size nor attention_mask_data have default values, the two None placeholders are required. The current invocation is correct and must remain as-is.

Likely an incorrect or invalid review comment.

@liji-nv liji-nv force-pushed the cherry-pick-18ccd05 branch 2 times, most recently from 8585a28 to 374d859 Compare August 18, 2025 06:26
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 18, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15602 [ run ] triggered by Bot

@liji-nv liji-nv force-pushed the cherry-pick-18ccd05 branch from 374d859 to 9032282 Compare August 18, 2025 07:16
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tensorrt_llm/_torch/models/modeling_speculative.py (1)

365-370: Make merge of extra_attrs robust; avoid hard assertions and accidental overwrites

Asserting on unexpected keys or missing registries can abort initialization in benign situations. Using dict.update() can clobber existing indices if they collide. Prefer filtering to known registries, ensure the target registry exists, and perform a non-clobbering merge.

Apply this diff:

-            if draft_config is not None:
-                for key, value in draft_config.extra_attrs.items():
-                    assert key in ('attn_layers', 'mla_layers')
-                    assert key in model_config.extra_attrs
-                    model_config.extra_attrs[key].update(value)
+            if draft_config is not None and getattr(draft_config, "extra_attrs", None):
+                for key, value in draft_config.extra_attrs.items():
+                    # Only merge layer registries we rely on
+                    if key not in ("attn_layers", "mla_layers") or not value:
+                        continue
+                    target = model_config.extra_attrs.setdefault(key, {})
+                    # Merge without clobbering existing indices (draft indices should be disjoint)
+                    for idx, ref in value.items():
+                        if idx in target:
+                            # Keep existing target ref; draft indices are expected to be offset/disjoint.
+                            continue
+                        target[idx] = ref
tensorrt_llm/_torch/modules/attention.py (1)

375-385: Harden forward_impl signature to accept ignored kwargs for forward-compatibility

Callers may pass benign extra kwargs (e.g., new flags). Accept and ignore unknown kwargs to avoid breaking changes.

Apply this diff:

-    def forward_impl(
+    def forward_impl(
         self,
         q: torch.Tensor,
         k: Optional[torch.Tensor],
         v: Optional[torch.Tensor],
         attn_metadata: AttentionMetadata,
         attention_mask: AttentionMask,
         attention_window_size: Optional[int],
         attention_mask_data: Optional[torch.Tensor],
-        mrope_config: Optional[dict],
-    ):
+        mrope_config: Optional[dict],
+        **kwargs,
+    ):
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/attention.py (1)

473-481: Propagate extra kwargs from forward() to forward_impl

If forward() receives benign extras, pass them through so forward_impl can ignore them. This reduces the risk of subtle breakages when call sites evolve.

Apply this diff:

-        output = self.forward_impl(q,
-                                   k,
-                                   v,
-                                   attn_metadata,
-                                   attention_mask,
-                                   attention_window_size,
-                                   attention_mask_data,
-                                   mrope_config=mrope_config)
+        output = self.forward_impl(q,
+                                   k,
+                                   v,
+                                   attn_metadata,
+                                   attention_mask,
+                                   attention_window_size,
+                                   attention_mask_data,
+                                   mrope_config=mrope_config,
+                                   **kwargs)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c7bf0fa and 9032282.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (1 hunks)
  • tensorrt_llm/_torch/modules/attention.py (3 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/test_lists/waives.txt (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/models/modeling_llama.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/modules/attention.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/modules/attention.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/attention.py (1)

95-96: Disabling NVFP4 output on the torch.compile custom-op path — LGTM

Setting enable_attn_nvfp4_output=False here is correct; it avoids an unsupported path under torch.compile for TRTLLM and keeps quantization handling centralized.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15602 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #174 completed with status: 'FAILURE'

NVIDIA#6858)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@liji-nv liji-nv force-pushed the cherry-pick-18ccd05 branch from 9032282 to 6eac92e Compare August 18, 2025 13:10
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 18, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15614 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
tensorrt_llm/_torch/modules/attention.py (1)

1-1: Prepend NVIDIA 2025 copyright header (repo policy).

This Python source file is missing the required NVIDIA copyright header.

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 import math
♻️ Duplicate comments (1)
tensorrt_llm/_torch/modules/attention.py (1)

375-385: Harden forward_impl signature for forward-compatibility and add return typing/docstring.

Accepting and ignoring extra kwargs prevents runtime breaks when upstreams pass benign extras (e.g., attention_sinks). Also add a concise docstring and an explicit return type.

Apply this diff:

-    def forward_impl(
+    def forward_impl(
         self,
         q: torch.Tensor,
         k: Optional[torch.Tensor],
         v: Optional[torch.Tensor],
         attn_metadata: AttentionMetadata,
         attention_mask: AttentionMask,
         attention_window_size: Optional[int],
         attention_mask_data: Optional[torch.Tensor],
         mrope_config: Optional[dict],
-    ):
+        **_ignored_kwargs,
+    ) -> Union[torch.Tensor, Fp4QuantizedTensor]:
+        """
+        Centralized attention execution used by compiled TRTLLM/FLASHINFER backends.
+        Accepts and ignores extra keyword arguments for forward-compatibility.
+        Args:
+            q, k, v: QKV tensors (fused or unfused depending on backend).
+            attn_metadata: Attention metadata.
+            attention_mask: Mask spec (e.g., PredefinedAttentionMask.CAUSAL).
+            attention_window_size: Optional sliding-window limit.
+            attention_mask_data: Optional mask payload for runtime backends.
+            mrope_config: Optional dict with mrope_rotary_cos_sin and/or mrope_position_deltas.
+        Returns:
+            Tensor or Fp4QuantizedTensor depending on quantization mode.
+        """

Additionally update imports to include Any only if you decide to annotate _ignored_kwargs values; otherwise it can remain untyped:

# near the top of the file
from typing import Optional, Union, cast  # current
# change to:
from typing import Optional, Union, cast, Any
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9032282 and 6eac92e.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (1 hunks)
  • tensorrt_llm/_torch/modules/attention.py (3 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/test_lists/waives.txt (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/models/modeling_llama.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/modules/attention.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/modules/attention.py
🔇 Additional comments (5)
tensorrt_llm/_torch/modules/attention.py (5)

70-97: Good: In custom-op path, explicitly disable NVFP4 attention output.

Passing enable_attn_nvfp4_output=False avoids NVFP4 output under torch.compile, which is consistent with the comment and avoids unsupported runtime paths.


394-401: Custom in-place op gating under torch.compile and supported backends looks correct.

The condition ensures we only route through the custom op when:

  • the layer is registered (so weakrefs/metadata can be resolved),
  • backend is TRTLLM or FLASHINFER,
  • and torch is compiling.

This minimizes surprises in eager runs.


401-415: Allocate output once and invoke custom op in-place: efficient and clean.

Using create_output() to pre-allocate and mutating in-place avoids extra allocations/copies. Nice.


416-424: NVFP4 tuple handling preserved in non-custom path.

Wrapping output into Fp4QuantizedTensor when output_sf is present maintains the previous contract and allows o_proj to consume quantized output seamlessly.


473-480: Forward now cleanly delegates to forward_impl.

Centralization reduces duplication and keeps backend routing logic in one place.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15614 [ run ] completed with state FAILURE
/LLM/release-1.0/L0_MergeRequest_PR pipeline #176 completed with status: 'FAILURE'

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15690 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15690 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #197 completed with status: 'FAILURE'

@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15748 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15748 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #211 completed with status: 'FAILURE'

@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 20, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15854 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15854 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #224 completed with status: 'SUCCESS'

@liji-nv liji-nv merged commit 69846c6 into NVIDIA:release/1.0 Aug 20, 2025
5 checks passed
yuanjingx87 pushed a commit that referenced this pull request Aug 20, 2025
#6978)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 5, 2025
NVIDIA#6978)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 5, 2025
NVIDIA#6978)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 6, 2025
NVIDIA#6978)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 6, 2025
NVIDIA#6978)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 7, 2025
NVIDIA#6978)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants