[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… #6978

liji-nv · 2025-08-18T03:46:11Z

…… (#6858)

This is a Cherry-pick from main branch.

Summary by CodeRabbit

New Features
- Optional optimized attention execution path for compatible backends to improve performance.
Refactor
- Centralized attention/rotary processing and removed per-branch FP8/NVFP4 output-scale wiring.
Bug Fixes
- Mirror draft model layer attributes into target configuration when present to keep specs consistent.
Tests
- Use a shared Torch compile configuration for FP8 accuracy tests and re-enable a previously skipped FP8 test.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-18T03:46:17Z

📝 Walkthrough

Walkthrough

Centralized attention execution by adding Attention.forward_impl and removing per-call FP8/NVFP4 out-scale wiring in LLaMA; merged draft model attn/MLA layer attrs into model_config for speculative decoding; adjusted Torch-compile test config usage and removed a global skip for an FP8 Eagle3 test.

Changes

Cohort / File(s)	Summary of Changes
Attention core path refactor `tensorrt_llm/_torch/modules/attention.py`	Added `Attention.forward_impl` to centralize attention logic, extract `mrope_config`, choose custom inplace op vs `_attn_impl`, and wrap tuple outputs as `Fp4QuantizedTensor`. Updated `attn_custom_op_inplace` call to pass `enable_attn_nvfp4_output=False` by name. `forward` now delegates to `forward_impl`.
LLaMA attention integration `tensorrt_llm/_torch/models/modeling_llama.py`	Removed manual computation/propagation of FP8/NVFP4 `out_scale`/`out_scale_sf`. Replaced `self.attn.forward(...)` with `self.attn.forward_impl(q, k, v, attn_metadata, attention_mask, None, None, mrope_config)` and preserved downstream `o_proj` handling (wrapping tuple outputs as `Fp4QuantizedTensor` where applicable).
Speculative decoding config sync `tensorrt_llm/_torch/models/modeling_speculative.py`	In `SpecDecOneEngineForCausalLM.__init__`, if `draft_config` exists, copy `draft_config.extra_attrs` entries (asserting keys are `attn_layers` or `mla_layers` and present) into `model_config.extra_attrs` via `.update()`. No signature changes.
Integration tests — Torch compile config `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	For two `test_fp8_eagle3` cases, create a `torch_compile_config` variable (`TorchCompileConfig(enable_fullgraph=True, enable_piecewise_cuda_graph=True, max_num_streams=3)` when enabled) and pass it in `pytorch_config` instead of constructing inline.
Test waivers `tests/integration/test_lists/waives.txt`	Removed the SKIP entry for `accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_fp8_eagle3[tp8-torch_compile=True]`, making the test eligible to run.

Sequence Diagram(s)

sequenceDiagram
    participant LlamaModel as modeling_llama
    participant Attn as Attention
    participant CustomOp as attn_custom_op_inplace
    participant Impl as _attn_impl

    LlamaModel->>Attn: forward_impl(q, k, v, attn_metadata, attention_mask, None, None, mrope_config)
    alt Use custom inplace op
        Attn->>CustomOp: run(..., mrope params, enable_attn_nvfp4_output=False)
        CustomOp-->>Attn: attn_output (tensor or tuple)
    else
        Attn->>Impl: _attn_impl(..., mrope params)
        Impl-->>Attn: attn_output (tensor or tuple)
    end
    Attn-->>LlamaModel: attn_output (wrapped as Fp4QuantizedTensor if tuple)
    LlamaModel->>LlamaModel: o_proj(attn_output)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… #6858: Similar refactor of attention call path and centralization of forward logic.
[https://nvbugs/5455836][fix] Fix llama 4 FP4 #6911: Changes attention output normalization to Fp4QuantizedTensor, matching output-wrapping behavior here.
[fix] Cherry pick "[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue" #6267: Overlaps on FP4/FP8 attention output scale handling at LLaMA call sites.

Suggested labels

Speculative Decoding, CI

Suggested reviewers

hlu1
yuxianq
byshiue
yechank-nvidia

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6eac92e and 6938be2.

📒 Files selected for processing (1)

tests/integration/test_lists/waives.txt (0 hunks)

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

liji-nv · 2025-08-18T03:46:44Z

/bot run

tensorrt-cicd · 2025-08-18T03:51:42Z

PR_Github #15578 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 3

🔭 Outside diff range comments (4)

tensorrt_llm/_torch/models/modeling_speculative.py (2)
330-332: Raise the correct exception type

Raising NotImplemented (a sentinel) will error out with a TypeError. Use NotImplementedError.

Apply this diff:
-        raise NotImplemented(
+        raise NotImplementedError(
             f"get_draft_model does not support speculative decoding mode {spec_dec_mode}."
         )
1-1: Missing SPDX copyright header

Per repository guidelines, prepend the NVIDIA SPDX header.

Apply this patch at the top of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
tensorrt_llm/_torch/modules/attention.py (1)
1-1: Missing SPDX copyright header

Per repository guidelines, prepend the NVIDIA SPDX header.

Apply this patch at the top of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
tensorrt_llm/_torch/models/modeling_llama.py (1)
1-1: Missing SPDX copyright header

Per repository guidelines, prepend the NVIDIA SPDX header.

Apply this patch at the top of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0

🧹 Nitpick comments (2)

tensorrt_llm/_torch/modules/attention.py (1)

401-415: Return semantics are consistent; consider documenting Fp4QuantizedTensor path

forward_impl returns either a Tensor or an Fp4QuantizedTensor (when output_sf is available). This is fine; add a short docstring or comment to set expectations for callers.
tensorrt_llm/_torch/models/modeling_llama.py (1)
179-181: Redundant tuple handling after forward_impl

forward_impl returns a Tensor (or Fp4QuantizedTensor). The tuple case won’t occur; remove for clarity.

Apply this minimal cleanup:
-        if isinstance(attn_output, tuple):
-            attn_output = Fp4QuantizedTensor(attn_output[0], attn_output[1])

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 96bda14 and 026df41.

📒 Files selected for processing (5)

tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
tensorrt_llm/_torch/models/modeling_speculative.py (1 hunks)
tensorrt_llm/_torch/modules/attention.py (3 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/integration/test_lists/waives.txt (0 hunks)

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/modules/attention.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/modules/attention.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py

🔇 Additional comments (3)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

553-556: TorchCompileConfig wiring looks correct

The torch_compile_config is constructed and threaded into pytorch_config as expected. This aligns with the new torch.compile-aware attention path.

Also applies to: 561-561

tensorrt_llm/_torch/modules/attention.py (1)

95-96: Good: explicit named argument for enable_attn_nvfp4_output

Switching to a named argument improves clarity and guards against call-site drift.

tensorrt_llm/_torch/models/modeling_llama.py (1)

169-178: No unexpected attention_sinks kwargs found on forward_impl calls

I ran the provided ripgrep checks and confirmed that the only call site passing attention_sinks to forward_impl is in
tensorrt_llm/_torch/models/modeling_llama.py:169–178.
All other uses of forward_impl (including named-arg calls in flashinfer.py) only pass the expected parameters.
No further action needed here.

tensorrt_llm/_torch/models/modeling_llama.py

tensorrt_llm/_torch/models/modeling_speculative.py

tensorrt_llm/_torch/modules/attention.py

tensorrt-cicd · 2025-08-18T06:02:59Z

PR_Github #15578 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #171 completed with status: 'FAILURE'

tensorrt_llm/_torch/models/modeling_llama.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tensorrt_llm/_torch/models/modeling_llama.py (2)
178-180: Remove dead tuple handling after forward_impl

forward_impl now returns either a Tensor or an Fp4QuantizedTensor (wrapping happens inside), so this tuple case is obsolete. Removing it reduces noise and avoids masking future mismatches.
-        if isinstance(attn_output, tuple):
-            attn_output = Fp4QuantizedTensor(attn_output[0], attn_output[1])
491-498: Llama4DecoderLayer does not pass attention_mask (unlike LlamaDecoderLayer) — verify encoder/prefill use

LlamaDecoderLayer sets attention_mask to FULL when not generation and passes it into self_attn. Llama4DecoderLayer doesn’t set or pass attention_mask, so it will always use the default CAUSAL. If Llama4Model can be used in encoder/prefill-only scenarios (is_generation=False), this will compute a causal mask incorrectly.

Proposed alignment with LlamaDecoderLayer:
@@
 class Llama4DecoderLayer(DecoderLayer):
@@
         self.moe_allreduce = MoEAllReduce(self.mapping)
 
+        # Match LlamaDecoderLayer behavior: use FULL mask when not generating (encoder/prefill).
+        self.attention_mask = PredefinedAttentionMask.CAUSAL
+        if not model_config.is_generation:
+            self.attention_mask = PredefinedAttentionMask.FULL
@@
         # Self Attention
         hidden_states = self.self_attn(
             position_ids=position_ids,
             hidden_states=hidden_states,
             attn_metadata=attn_metadata,
+            attention_mask=self.attention_mask,
             all_reduce_params=AllReduceParams(
                 enable_allreduce=not self.disable_attn_allreduce),
             **kwargs,
         )
If Llama4 is never used in prefill-only mode, feel free to ignore; otherwise, this avoids subtle accuracy regressions.

Also applies to: 655-659

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 026df41 and c7bf0fa.

📒 Files selected for processing (5)

tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
tensorrt_llm/_torch/models/modeling_speculative.py (1 hunks)
tensorrt_llm/_torch/modules/attention.py (3 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/integration/test_lists/waives.txt (0 hunks)

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/_torch/models/modeling_speculative.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/modules/attention.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_llama.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_llama.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tensorrt_llm/_torch/models/modeling_llama.py (2)

169-176: Fixed: removed unsupported kwarg and switched to forward_impl — LGTM

Dropping the extraneous attention_sinks kwarg and delegating to forward_impl aligns with the new centralized attention path and should resolve the TypeError seen earlier. Passing None placeholders for output scales is consistent with removing per-call scale wiring.

169-176: forward_impl requires explicit None placeholders
The forward_impl signature in tensorrt_llm/_torch/modules/attention.py declares its last three parameters without defaults:
def forward_impl(
    self,
    q: torch.Tensor,
    k: Optional[torch.Tensor],
    v: Optional[torch.Tensor],
    attn_metadata: AttentionMetadata,
    attention_mask: AttentionMask,
    attention_window_size: Optional[int],
    attention_mask_data: Optional[torch.Tensor],
    mrope_config: Optional[dict],
):
Because neither attention_window_size nor attention_mask_data have default values, the two None placeholders are required. The current invocation is correct and must remain as-is.

Likely an incorrect or invalid review comment.

liji-nv · 2025-08-18T06:26:12Z

/bot run

tensorrt-cicd · 2025-08-18T06:31:14Z

PR_Github #15602 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tensorrt_llm/_torch/models/modeling_speculative.py (1)

365-370: Make merge of extra_attrs robust; avoid hard assertions and accidental overwrites

Asserting on unexpected keys or missing registries can abort initialization in benign situations. Using dict.update() can clobber existing indices if they collide. Prefer filtering to known registries, ensure the target registry exists, and perform a non-clobbering merge.

Apply this diff:

-            if draft_config is not None:
-                for key, value in draft_config.extra_attrs.items():
-                    assert key in ('attn_layers', 'mla_layers')
-                    assert key in model_config.extra_attrs
-                    model_config.extra_attrs[key].update(value)
+            if draft_config is not None and getattr(draft_config, "extra_attrs", None):
+                for key, value in draft_config.extra_attrs.items():
+                    # Only merge layer registries we rely on
+                    if key not in ("attn_layers", "mla_layers") or not value:
+                        continue
+                    target = model_config.extra_attrs.setdefault(key, {})
+                    # Merge without clobbering existing indices (draft indices should be disjoint)
+                    for idx, ref in value.items():
+                        if idx in target:
+                            # Keep existing target ref; draft indices are expected to be offset/disjoint.
+                            continue
+                        target[idx] = ref

tensorrt_llm/_torch/modules/attention.py (1)

375-385: Harden forward_impl signature to accept ignored kwargs for forward-compatibility

Callers may pass benign extra kwargs (e.g., new flags). Accept and ignore unknown kwargs to avoid breaking changes.

Apply this diff:

-    def forward_impl(
+    def forward_impl(
         self,
         q: torch.Tensor,
         k: Optional[torch.Tensor],
         v: Optional[torch.Tensor],
         attn_metadata: AttentionMetadata,
         attention_mask: AttentionMask,
         attention_window_size: Optional[int],
         attention_mask_data: Optional[torch.Tensor],
-        mrope_config: Optional[dict],
-    ):
+        mrope_config: Optional[dict],
+        **kwargs,
+    ):

🧹 Nitpick comments (1)

tensorrt_llm/_torch/modules/attention.py (1)

473-481: Propagate extra kwargs from forward() to forward_impl

If forward() receives benign extras, pass them through so forward_impl can ignore them. This reduces the risk of subtle breakages when call sites evolve.

Apply this diff:

-        output = self.forward_impl(q,
-                                   k,
-                                   v,
-                                   attn_metadata,
-                                   attention_mask,
-                                   attention_window_size,
-                                   attention_mask_data,
-                                   mrope_config=mrope_config)
+        output = self.forward_impl(q,
+                                   k,
+                                   v,
+                                   attn_metadata,
+                                   attention_mask,
+                                   attention_window_size,
+                                   attention_mask_data,
+                                   mrope_config=mrope_config,
+                                   **kwargs)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c7bf0fa and 9032282.

📒 Files selected for processing (5)

tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
tensorrt_llm/_torch/models/modeling_speculative.py (1 hunks)
tensorrt_llm/_torch/modules/attention.py (3 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/integration/test_lists/waives.txt (0 hunks)

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

🚧 Files skipped from review as they are similar to previous changes (2)

tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/models/modeling_llama.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/modules/attention.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/modules/attention.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/modules/attention.py (1)

95-96: Disabling NVFP4 output on the torch.compile custom-op path — LGTM

Setting enable_attn_nvfp4_output=False here is correct; it avoids an unsupported path under torch.compile for TRTLLM and keeps quantization handling centralized.

tensorrt-cicd · 2025-08-18T11:48:53Z

PR_Github #15602 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #174 completed with status: 'FAILURE'

NVIDIA#6858) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv · 2025-08-18T13:10:41Z

/bot run

tensorrt-cicd · 2025-08-18T13:15:56Z

PR_Github #15614 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

tensorrt_llm/_torch/modules/attention.py (1)
1-1: Prepend NVIDIA 2025 copyright header (repo policy).

This Python source file is missing the required NVIDIA copyright header.

Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 import math

♻️ Duplicate comments (1)

tensorrt_llm/_torch/modules/attention.py (1)

375-385: Harden forward_impl signature for forward-compatibility and add return typing/docstring.

Accepting and ignoring extra kwargs prevents runtime breaks when upstreams pass benign extras (e.g., attention_sinks). Also add a concise docstring and an explicit return type.

Apply this diff:

-    def forward_impl(
+    def forward_impl(
         self,
         q: torch.Tensor,
         k: Optional[torch.Tensor],
         v: Optional[torch.Tensor],
         attn_metadata: AttentionMetadata,
         attention_mask: AttentionMask,
         attention_window_size: Optional[int],
         attention_mask_data: Optional[torch.Tensor],
         mrope_config: Optional[dict],
-    ):
+        **_ignored_kwargs,
+    ) -> Union[torch.Tensor, Fp4QuantizedTensor]:
+        """
+        Centralized attention execution used by compiled TRTLLM/FLASHINFER backends.
+        Accepts and ignores extra keyword arguments for forward-compatibility.
+        Args:
+            q, k, v: QKV tensors (fused or unfused depending on backend).
+            attn_metadata: Attention metadata.
+            attention_mask: Mask spec (e.g., PredefinedAttentionMask.CAUSAL).
+            attention_window_size: Optional sliding-window limit.
+            attention_mask_data: Optional mask payload for runtime backends.
+            mrope_config: Optional dict with mrope_rotary_cos_sin and/or mrope_position_deltas.
+        Returns:
+            Tensor or Fp4QuantizedTensor depending on quantization mode.
+        """

Additionally update imports to include Any only if you decide to annotate _ignored_kwargs values; otherwise it can remain untyped:

# near the top of the file
from typing import Optional, Union, cast  # current
# change to:
from typing import Optional, Union, cast, Any

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9032282 and 6eac92e.

📒 Files selected for processing (5)

tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
tensorrt_llm/_torch/models/modeling_speculative.py (1 hunks)
tensorrt_llm/_torch/modules/attention.py (3 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/integration/test_lists/waives.txt (0 hunks)

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/_torch/models/modeling_speculative.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/models/modeling_llama.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/modules/attention.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/modules/attention.py

🔇 Additional comments (5)

tensorrt_llm/_torch/modules/attention.py (5)

70-97: Good: In custom-op path, explicitly disable NVFP4 attention output.

Passing enable_attn_nvfp4_output=False avoids NVFP4 output under torch.compile, which is consistent with the comment and avoids unsupported runtime paths.

394-401: Custom in-place op gating under torch.compile and supported backends looks correct.

The condition ensures we only route through the custom op when:

the layer is registered (so weakrefs/metadata can be resolved),

backend is TRTLLM or FLASHINFER,

and torch is compiling.

This minimizes surprises in eager runs.

401-415: Allocate output once and invoke custom op in-place: efficient and clean.

Using create_output() to pre-allocate and mutating in-place avoids extra allocations/copies. Nice.

416-424: NVFP4 tuple handling preserved in non-custom path.

Wrapping output into Fp4QuantizedTensor when output_sf is present maintains the previous contract and allows o_proj to consume quantized output seamlessly.

473-480: Forward now cleanly delegates to forward_impl.

Centralization reduces duplication and keeps backend routing logic in one place.

tensorrt_llm/_torch/modules/attention.py

tensorrt-cicd · 2025-08-18T13:54:50Z

PR_Github #15614 [ run ] completed with state FAILURE
/LLM/release-1.0/L0_MergeRequest_PR pipeline #176 completed with status: 'FAILURE'

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv · 2025-08-19T02:41:51Z

/bot run

tensorrt-cicd · 2025-08-19T02:47:38Z

PR_Github #15690 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-19T07:29:34Z

PR_Github #15690 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #197 completed with status: 'FAILURE'

liji-nv · 2025-08-19T08:29:41Z

/bot run

tensorrt-cicd · 2025-08-19T08:34:56Z

PR_Github #15748 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-19T19:05:46Z

PR_Github #15748 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #211 completed with status: 'FAILURE'

liji-nv · 2025-08-20T02:38:18Z

/bot run

tensorrt-cicd · 2025-08-20T02:44:03Z

PR_Github #15854 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-20T06:40:00Z

PR_Github #15854 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #224 completed with status: 'SUCCESS'

#6978) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

NVIDIA#6978) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

liji-nv requested a review from a team as a code owner August 18, 2025 03:46

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_llama.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/models/modeling_speculative.py Show resolved Hide resolved

tensorrt_llm/_torch/modules/attention.py Show resolved Hide resolved

liji-nv commented Aug 18, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_llama.py Outdated Show resolved Hide resolved

liji-nv force-pushed the cherry-pick-18ccd05 branch from e8c599b to c7bf0fa Compare August 18, 2025 06:10

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

liji-nv force-pushed the cherry-pick-18ccd05 branch 2 times, most recently from 8585a28 to 374d859 Compare August 18, 2025 06:26

liji-nv force-pushed the cherry-pick-18ccd05 branch from 374d859 to 9032282 Compare August 18, 2025 07:16

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… (

6eac92e

NVIDIA#6858) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv force-pushed the cherry-pick-18ccd05 branch from 9032282 to 6eac92e Compare August 18, 2025 13:10

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

tensorrt_llm/_torch/modules/attention.py Show resolved Hide resolved

Merge branch 'release/1.0' into cherry-pick-18ccd05

6938be2

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

litaotju approved these changes Aug 19, 2025

View reviewed changes

liji-nv merged commit 69846c6 into NVIDIA:release/1.0 Aug 20, 2025
5 checks passed

yuanjingx87 pushed a commit that referenced this pull request Aug 20, 2025

[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… (

a4df6c6

#6978) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

coderabbitai bot mentioned this pull request Aug 22, 2025

[https://nvbugs/5461712] [fix] Disable deep_gemm for Qwen3 due to accuracy issues #7170

Merged

coderabbitai bot mentioned this pull request Sep 1, 2025

[https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… #7441

Merged

1 task

[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… #6978

[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… #6978

Uh oh!

Conversation

liji-nv commented Aug 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

liji-nv commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

liji-nv commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

liji-nv commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

liji-nv commented Aug 19, 2025

Uh oh!

tensorrt-cicd commented Aug 19, 2025

Uh oh!

tensorrt-cicd commented Aug 19, 2025

Uh oh!

liji-nv commented Aug 19, 2025

Uh oh!

tensorrt-cicd commented Aug 19, 2025

Uh oh!

tensorrt-cicd commented Aug 19, 2025

Uh oh!

liji-nv commented Aug 20, 2025

Uh oh!

tensorrt-cicd commented Aug 20, 2025

Uh oh!

liji-nv commented Aug 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 18, 2025 •

edited

Loading