KEMBAR78
[#6186][feat] Introduce QKNormRoPEAttention module by Funatiq · Pull Request #6830 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@Funatiq
Copy link
Collaborator

@Funatiq Funatiq commented Aug 12, 2025

Summary by CodeRabbit

  • New Features

    • Qwen3 models now use a redesigned self-attention that integrates Q/K normalization with rotary positional embeddings for improved stability and output quality, including MoE variants.
  • Performance Improvements

    • Faster inference via a fused attention execution path and parallelized normalization, reducing latency and compute overhead.
  • Compatibility

    • No user-facing API changes; existing Qwen3 workflows remain supported.

Description

  • Added QKNormRoPEAttention class to apply QK normalization and RoPE.
  • Replaced Qwen3Attention with QKNormRoPEAttention in Qwen3DecoderLayer.
  • Removed unused imports and code related to the previous attention implementation.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 12, 2025

Note

Currently processing new changes in this PR. This may take a few minutes, please wait...

📥 Commits

Reviewing files that changed from the base of the PR and between cebaaa1 and 8660abb.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/models/modeling_exaone4.py (4 hunks)
  • tensorrt_llm/_torch/models/modeling_gemma3.py (3 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3.py (2 hunks)
  • tensorrt_llm/_torch/modules/qk_norm_attention.py (1 hunks)
 ___________________________________________________________
< Finding disturbances in the code before the Jedi Masters. >
 -----------------------------------------------------------
  \
   \   (\__/)
       (•ㅅ•)
       /   づ
📝 Walkthrough

Walkthrough

Replaces Qwen3Attention with a new QKNormRoPEAttention and updates base and MoE decoder layers to instantiate it; introduces fused and non-fused flows that combine Q/K RMS-normalization with RoPE. Other model components (embeddings, MLP, RMSNorm) remain functionally unchanged.

Changes

Cohort / File(s) Summary
Qwen3 model refactor
tensorrt_llm/_torch/models/modeling_qwen3.py
Removed Qwen3Attention; added Qwen3DecoderLayer that constructs QKNormRoPEAttention; eliminated Q/K-norm and RoPE scaffolding; forward flow preserved aside from attention swap.
New attention module (QK-Norm + RoPE)
tensorrt_llm/_torch/modules/qk_norm_attention.py
Added QKNormRoPEAttention (subclass of Attention) that initializes RoPE params from pretrained config and implements fused (torch.ops.trtllm.fused_qk_norm_rope) and non-fused flows; adds apply_qk_norm, apply_qk_norm_rope, and overrides apply_rope.
MoE model update
tensorrt_llm/_torch/models/modeling_qwen3_moe.py
Replaced Qwen3Attention with QKNormRoPEAttention in Qwen3MoEDecoderLayer; updated imports and initialization arguments accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Layer as Qwen3DecoderLayer
    participant Attn as QKNormRoPEAttention
    participant Fused as torch.ops.trtllm.fused_qk_norm_rope
    participant Norm as RMSNorm(Q)/RMSNorm(K)
    participant Base as Attention.apply_rope

    Layer->>Attn: forward(qkv or q,k,v, position_ids, ...)
    alt fuse_qk_norm_rope == True
        Attn->>Fused: fused_qk_norm_rope(qkv, heads, head_dim, eps, norm_w, rope_params, position_ids)
        Fused-->>Attn: qkv_with_norm_and_rope
    else Non-fused
        Attn->>Norm: apply_qk_norm(q, k) (parallel)
        Norm-->>Attn: q_norm, k_norm
        Attn->>Base: apply_rope(q_norm, k_norm, v, position_ids)
        Base-->>Attn: q/k/v_with_rope
    end
    Attn-->>Layer: attention outputs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Suggested labels

Community want to contribute

Suggested reviewers

  • shaharmor98
  • nv-yilinf
  • byshiue

Tip

CodeRabbit can use Shopify Theme Check to improve the quality of Shopify theme reviews.

Add a configuration file to your project to customize how CodeRabbit runs Shopify Theme Check.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 12, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14981 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
tensorrt_llm/_torch/modules/qk_norm_attention.py (1)

1-112: Please add comprehensive unit tests for the QKNormRoPEAttention module.

No existing tests cover this new module—let’s ensure it’s thoroughly exercised:

• Create a new test file (e.g. tests/unittest/_torch/test_qk_norm_attention.py)
• Cover both execution paths:
– fuse_qk_norm_rope=True (fused QK-norm+RoPE)
– fuse_qk_norm_rope=False (separate QK-norm then RoPE)
• Test a variety of input shapes and configurations:
– Single sequence vs. batch sequences
– Varying num_heads and num_key_value_heads
• Edge cases:
– Empty tensors (zero-length sequence)
– Invalid or out-of-range position_ids (should error or handle gracefully)
• Verify attention internals:
– Q and K outputs are correctly normalized (e.g. compare against standalone RMSNorm)
– RoPE is applied correctly (match baseline implementation)
• Parallel execution path via maybe_execute_in_parallel (ensure both streams complete)

Adding these tests will catch regressions in both fused and non-fused code paths and validate the QK normalization logic.

🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/qk_norm_attention.py (2)

65-66: Consider adding CUDA availability check.

The code directly creates CUDA streams and events without checking if CUDA is available. This could cause issues in CPU-only environments or during testing.

Consider adding a CUDA availability check:

-        self.aux_stream = torch.cuda.Stream()
-        self.ln_events = [torch.cuda.Event(), torch.cuda.Event()]
+        if torch.cuda.is_available():
+            self.aux_stream = torch.cuda.Stream()
+            self.ln_events = [torch.cuda.Event(), torch.cuda.Event()]
+        else:
+            self.aux_stream = None
+            self.ln_events = None

And update the apply_qk_norm method to handle the None case appropriately.


109-111: Improve assertion error message.

The assertion provides a clear constraint but could be more informative about why this requirement exists.

-        assert k is None and v is None, "The input should be a concatenated qkv tensor to apply_qk_norm_rope"
+        assert k is None and v is None, (
+            "When fuse_qk_norm_rope=True, the input must be a concatenated qkv tensor "
+            "(k and v should be None) for the fused operator to work correctly"
+        )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a060e12 and befbf0f.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/models/modeling_qwen3.py (2 hunks)
  • tensorrt_llm/_torch/modules/qk_norm_attention.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/modules/qk_norm_attention.py
  • tensorrt_llm/_torch/models/modeling_qwen3.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/modules/qk_norm_attention.py
  • tensorrt_llm/_torch/models/modeling_qwen3.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/qk_norm_attention.py

50-50: Line too long (170 > 120)

(E501)

🔇 Additional comments (2)
tensorrt_llm/_torch/models/modeling_qwen3.py (2)

13-13: LGTM! Clean integration of the new attention module.

The import change correctly replaces the previous attention implementation with the new QKNormRoPEAttention module.


30-33: LGTM! Proper instantiation of QKNormRoPEAttention.

The replacement of Qwen3Attention with QKNormRoPEAttention is cleanly implemented, maintaining the same interface with model_config and layer_idx parameters.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14981 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11311 completed with status: 'FAILURE'

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 12, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14985 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14985 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11315 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 13, 2025

/bot run

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 13, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15115 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15117 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15115 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_qwen3_moe.py (1)

1-1: Add NVIDIA copyright header (2025) at file top.

Required by coding guidelines for all source files.

Apply this diff:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 import os
tensorrt_llm/_torch/models/modeling_qwen3.py (1)

1-1: Add NVIDIA copyright header (2025) at file top.

Required by coding guidelines.

Apply this diff:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 from typing import Optional, Tuple
♻️ Duplicate comments (2)
tensorrt_llm/_torch/modules/qk_norm_attention.py (2)

66-73: Use configured RMSNorm epsilon instead of hardcoded 1e-6.

Keep normalization consistent with model config by using the configured eps.

Apply this diff:

-        self.q_norm = RMSNorm(hidden_size=self.head_dim,
-                              eps=1e-6,
-                              dtype=pretrained_config.torch_dtype,
-                              has_weights=True)
-        self.k_norm = RMSNorm(hidden_size=self.head_dim,
-                              eps=1e-6,
-                              dtype=pretrained_config.torch_dtype,
-                              has_weights=True)
+        self.q_norm = RMSNorm(hidden_size=self.head_dim,
+                              eps=pretrained_config.rms_norm_eps,
+                              dtype=pretrained_config.torch_dtype,
+                              has_weights=True)
+        self.k_norm = RMSNorm(hidden_size=self.head_dim,
+                              eps=pretrained_config.rms_norm_eps,
+                              dtype=pretrained_config.torch_dtype,
+                              has_weights=True)

97-104: Wrap fused operator call with availability check and clear error handling.

Improve robustness and debuggability if the custom op is missing or fails.

Apply this diff:

-    def apply_qk_norm_rope(self, qkv, position_ids):
-        torch.ops.trtllm.fused_qk_norm_rope(
-            qkv, self.num_heads, self.num_key_value_heads,
-            self.num_key_value_heads, self.head_dim,
-            self.q_norm.variance_epsilon, self.q_norm.weight,
-            self.k_norm.weight, self.pos_embd_params.rope.theta,
-            self.pos_embd_params.is_neox, position_ids.view(-1))
-        return qkv, None, None
+    def apply_qk_norm_rope(self, qkv, position_ids):
+        op = getattr(torch.ops.trtllm, "fused_qk_norm_rope", None)
+        if op is None:
+            raise RuntimeError(
+                "torch.ops.trtllm.fused_qk_norm_rope is not available in this build."
+            )
+        try:
+            op(qkv,
+               self.num_heads,
+               self.num_key_value_heads,
+               self.num_key_value_heads,  # verify this argument; see follow-up comment
+               self.head_dim,
+               self.q_norm.variance_epsilon,
+               self.q_norm.weight,
+               self.k_norm.weight,
+               self.pos_embd_params.rope.theta,
+               self.pos_embd_params.is_neox,
+               position_ids.view(-1))
+        except Exception as e:
+            raise RuntimeError(f"fused_qk_norm_rope failed: {e}") from e
+        return qkv, None, None
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/qk_norm_attention.py (1)

77-86: Add short docstrings for internal helpers.

Concise docstrings aid maintainability and clarify expected shapes/semantics.

Apply this diff:

     def apply_qk_norm(self, q, k):
+        """Apply RMSNorm to Q and K in parallel using an auxiliary CUDA stream.
+
+        Args:
+            q (torch.Tensor): Query tensor shaped (..., q_size, head_dim) flattened internally.
+            k (torch.Tensor): Key tensor shaped (..., kv_size, head_dim) flattened internally.
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]: Normalized Q and K with original leading dims restored.
+        """
@@
     def apply_qk_norm_rope(self, qkv, position_ids):
+        """Apply fused Q/K RMSNorm and RoPE via custom TRTLLM operator.
+
+        Args:
+            qkv (torch.Tensor): Concatenated QKV projection tensor before splitting.
+            position_ids (torch.Tensor): Positions shaped (batch*seq,) or broadcastable.
+        Returns:
+            Tuple[torch.Tensor, None, None]: Modified qkv and placeholders to match base interface.
+        """

Also applies to: 97-104

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c51a59 and 0443a6d.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/models/modeling_qwen3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py (2 hunks)
  • tensorrt_llm/_torch/modules/qk_norm_attention.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/models/modeling_qwen3.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
  • tensorrt_llm/_torch/modules/qk_norm_attention.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/models/modeling_qwen3.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
  • tensorrt_llm/_torch/modules/qk_norm_attention.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/modules/qk_norm_attention.py (1)

98-103: Please confirm the fused_qk_norm_rope operator signature
We couldn’t locate the C++/CUDA registration in this repo, yet the Python calls pass num_key_value_heads twice (3rd and 4th args). Verify in the native plugin that these parameters match the operator’s expected signature to avoid subtle bugs.

• tensorrt_llm/_torch/modules/qk_norm_attention.py: line 98
• tensorrt_llm/_torch/models/modeling_exaone4.py: line 125
• tests/unittest/_torch/thop/test_fused_qk_norm_rope.py: line 146

tensorrt_llm/_torch/models/modeling_qwen3_moe.py (1)

169-179: Switch to QKNormRoPEAttention looks correct and consistent.

Initialization matches the new module’s constructor (hidden_size, heads, kv_heads, positions, bias, dtype, dense_bias, config). No pos_embd_params is passed here, which aligns with QKNormRoPEAttention computing it from config.

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 13, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15137 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15117 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #11413 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_qwen3.py (1)

56-66: Fix forward() return annotation to match actual return type

forward() currently annotates a single torch.Tensor but returns a tuple (hidden_states, residual). Update the annotation to prevent type-checking confusion and reader errors.

Apply this diff:

-    ) -> torch.Tensor:
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
♻️ Duplicate comments (4)
tensorrt_llm/_torch/models/modeling_qwen3.py (2)

22-29: Good fix: init no longer has an invalid return annotation

Previous feedback about init returning a Tuple is addressed; the signature now correctly returns None implicitly.


30-40: Correct removal of unsupported pos_embd_params when constructing QKNormRoPEAttention

The constructor no longer passes pos_embd_params, aligning with QKNormRoPEAttention’s API. This avoids a TypeError at instantiation.

tensorrt_llm/_torch/modules/qk_norm_attention.py (2)

81-88: Use model-configured RMSNorm epsilon instead of hardcoded 1e-6

Match other layers’ eps and avoid silent divergence across components.

Apply this diff:

-        self.q_norm = RMSNorm(hidden_size=self.head_dim,
-                              eps=1e-6,
+        self.q_norm = RMSNorm(hidden_size=self.head_dim,
+                              eps=pretrained_config.rms_norm_eps,
                               dtype=pretrained_config.torch_dtype,
                               has_weights=True)
-        self.k_norm = RMSNorm(hidden_size=self.head_dim,
-                              eps=1e-6,
+        self.k_norm = RMSNorm(hidden_size=self.head_dim,
+                              eps=pretrained_config.rms_norm_eps,
                               dtype=pretrained_config.torch_dtype,
                               has_weights=True)

112-120: Wrap fused operator call with availability check and clear error handling

Improve debuggability when the custom op is missing or fails.

Apply this diff:

-        torch.ops.trtllm.fused_qk_norm_rope(
-            qkv, self.num_heads, self.num_key_value_heads,
-            self.num_key_value_heads, self.head_dim,
-            self.q_norm.variance_epsilon, self.q_norm.weight,
-            self.k_norm.weight, self.pos_embd_params.rope.theta,
-            self.pos_embd_params.is_neox, position_ids.view(-1))
+        if not hasattr(torch.ops, "trtllm") or not hasattr(torch.ops.trtllm, "fused_qk_norm_rope"):
+            raise RuntimeError("torch.ops.trtllm.fused_qk_norm_rope is not available. "
+                               "Ensure the TRT-LLM extension is built and loaded.")
+        try:
+            torch.ops.trtllm.fused_qk_norm_rope(
+                qkv, self.num_heads, self.num_key_value_heads,
+                self.num_key_value_heads, self.head_dim,
+                self.q_norm.variance_epsilon, self.q_norm.weight,
+                self.k_norm.weight, self.pos_embd_params.rope.theta,
+                self.pos_embd_params.is_neox, position_ids.view(-1))
+        except (RuntimeError, AttributeError) as e:
+            raise RuntimeError(f"Failed to apply fused_qk_norm_rope: {e}") from e

If helpful, I can add a lightweight unit test stub that conditionally skips when the fused op is unavailable and validates the non-fused path equivalence on small inputs.

🧹 Nitpick comments (2)
tensorrt_llm/_torch/models/modeling_qwen3.py (1)

130-134: Minor readability: simplify the XOR check for inputs selection

The XOR condition is correct but less common in codebases. A more idiomatic form is equality on None-ness.

Apply this diff:

-        if (input_ids is None) ^ (inputs_embeds is not None):
+        if (input_ids is None) == (inputs_embeds is None):
             raise ValueError(
                 "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
             )
tensorrt_llm/_torch/modules/qk_norm_attention.py (1)

28-33: Docstring could specify expected tensor shapes and execution paths

Consider briefly documenting the expected input shapes for q, k, v in both fused and non-fused modes and how position_ids should be shaped. This reduces ambiguity for future maintainers.

I can add a short Google-style docstring with shapes and examples if you’d like.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0443a6d and cebaaa1.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/models/modeling_qwen3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py (2 hunks)
  • tensorrt_llm/_torch/modules/qk_norm_attention.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/models/modeling_qwen3.py
  • tensorrt_llm/_torch/modules/qk_norm_attention.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/models/modeling_qwen3.py
  • tensorrt_llm/_torch/modules/qk_norm_attention.py
🧠 Learnings (2)
📚 Learning: 2025-08-12T10:28:57.288Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.288Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

  • tensorrt_llm/_torch/modules/qk_norm_attention.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/qk_norm_attention.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/_torch/models/modeling_qwen3.py (1)

141-149: Confirm non-Optional position_ids expectation

Qwen3DecoderLayer.forward expects position_ids: torch.IntTensor (non-Optional), but Qwen3Model.forward accepts Optional and forwards it. Ensure position_ids is always provided at runtime; otherwise, type checkers and future refactors might miss a None case.

Would you like me to scan the repository for call sites that invoke Qwen3Model.forward without position_ids (e.g., decoder-only paths that rely on attn_metadata) and report them?

tensorrt_llm/_torch/modules/qk_norm_attention.py (3)

1-2: Header compliance: SPDX header present and correct year

Meets the repository’s requirement to prepend NVIDIA headers.


72-75: Correct rope_fusion gating relative to fuse_qk_norm_rope

The rope_fusion=not self.fuse_qk_norm_rope setting matches the comment and prevents duplicate RoPE application when the fused QK-norm+RoPE path is used.


112-118: No action required: fused_qk_norm_rope arguments are correct

I’ve confirmed in cpp/tensorrt_llm/thop/fusedQKNormRopeOp.cpp that the operator signature is:

fused_qk_norm_rope(
  Tensor& qkv,
  int64 num_heads_q,
  int64 num_heads_k,
  int64 num_heads_v,
  int64 head_dim,
  float eps,
  Tensor q_weight,
  Tensor k_weight,
  float base,
  bool is_neox,
  Tensor position_ids
)

Passing self.num_heads, self.num_key_value_heads, self.num_key_value_heads for q/k/v aligns exactly with that signature—and the pytest in tests/unittest/_torch/thop/test_fused_qk_norm_rope.py uses the same ordering. No changes needed.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15137 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11430 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Funatiq Funatiq marked this pull request as ready for review August 13, 2025 20:15
@Funatiq Funatiq requested review from a team as code owners August 13, 2025 20:15
@Funatiq Funatiq requested review from 2ez4bz, QiJune and byshiue August 13, 2025 20:15
@tensorrt-cicd
Copy link
Collaborator

PR_Github #15814 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11887 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 28, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16878 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16878 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12675 completed with status: 'FAILURE'

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 28, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16894 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16894 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12691 completed with status: 'FAILURE'

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 29, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16974 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16974 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12742 completed with status: 'FAILURE'

…hanism in Qwen3 model

- Added QKNormRoPEAttention class to apply QK normalization and RoPE.
- Replaced Qwen3Attention with QKNormRoPEAttention in Qwen3DecoderLayer and Qwen3MoEDecoderLayer.
- Removed unused imports and code related to the previous attention implementation.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Updated QKNormRoPEAttention constructor to accept parameters directly instead of a model_config object.
- Modified Qwen3DecoderLayer and Qwen3MoEDecoderLayer to pass specific configuration values to QKNormRoPEAttention.
- Enhanced clarity and flexibility in attention layer configuration.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Included SPDX license headers for compliance and clarity.
- Ensured proper licensing information is present in the QKNormAttention implementation.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Enhanced QKNormRoPEAttention to accept positional embedding parameters.
- Inherit QKNormRoPEAttention in Qwen3Attention, specialize it for Qwen3 model.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Updated Gemma3Attention class to extend QKNormRoPEAttention, enhancing its functionality.
- Removed redundant QK normalization methods and integrated them into the new parent class.
- Adjusted initialization parameters to align with the new attention mechanism.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Updated Exaone4Attention class to extend QKNormRoPEAttention, streamlining the attention mechanism.
- Removed redundant QK normalization methods and integrated functionality into the parent class.
- Adjusted initialization parameters for improved clarity and flexibility in attention layer configuration.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Enhanced the logic to check for yarn usage by verifying both rope_scaling and rope_type in the configuration.
- Improved code clarity by updating comments to reflect the new checks.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
@Funatiq
Copy link
Collaborator Author

Funatiq commented Sep 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17529 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17529 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13178 completed with status: 'SUCCESS'

@Funatiq Funatiq merged commit a95d961 into NVIDIA:main Sep 5, 2025
5 checks passed
@Funatiq Funatiq deleted the dev/feat/qk_norm branch September 5, 2025 12:04
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants