-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[#6186][feat] Introduce QKNormRoPEAttention module #6830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughReplaces Qwen3Attention with a new QKNormRoPEAttention and updates base and MoE decoder layers to instantiate it; introduces fused and non-fused flows that combine Q/K RMS-normalization with RoPE. Other model components (embeddings, MLP, RMSNorm) remain functionally unchanged. Changes
Sequence Diagram(s)sequenceDiagram
participant Layer as Qwen3DecoderLayer
participant Attn as QKNormRoPEAttention
participant Fused as torch.ops.trtllm.fused_qk_norm_rope
participant Norm as RMSNorm(Q)/RMSNorm(K)
participant Base as Attention.apply_rope
Layer->>Attn: forward(qkv or q,k,v, position_ids, ...)
alt fuse_qk_norm_rope == True
Attn->>Fused: fused_qk_norm_rope(qkv, heads, head_dim, eps, norm_w, rope_params, position_ids)
Fused-->>Attn: qkv_with_norm_and_rope
else Non-fused
Attn->>Norm: apply_qk_norm(q, k) (parallel)
Norm-->>Attn: q_norm, k_norm
Attn->>Base: apply_rope(q_norm, k_norm, v, position_ids)
Base-->>Attn: q/k/v_with_rope
end
Attn-->>Layer: attention outputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes Suggested labels
Suggested reviewers
Tip CodeRabbit can use Shopify Theme Check to improve the quality of Shopify theme reviews.Add a configuration file to your project to customize how CodeRabbit runs Shopify Theme Check. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
/bot run |
|
PR_Github #14981 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
tensorrt_llm/_torch/modules/qk_norm_attention.py (1)
1-112: Please add comprehensive unit tests for the QKNormRoPEAttention module.No existing tests cover this new module—let’s ensure it’s thoroughly exercised:
• Create a new test file (e.g. tests/unittest/_torch/test_qk_norm_attention.py)
• Cover both execution paths:
– fuse_qk_norm_rope=True (fused QK-norm+RoPE)
– fuse_qk_norm_rope=False (separate QK-norm then RoPE)
• Test a variety of input shapes and configurations:
– Single sequence vs. batch sequences
– Varying num_heads and num_key_value_heads
• Edge cases:
– Empty tensors (zero-length sequence)
– Invalid or out-of-range position_ids (should error or handle gracefully)
• Verify attention internals:
– Q and K outputs are correctly normalized (e.g. compare against standalone RMSNorm)
– RoPE is applied correctly (match baseline implementation)
• Parallel execution path via maybe_execute_in_parallel (ensure both streams complete)Adding these tests will catch regressions in both fused and non-fused code paths and validate the QK normalization logic.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/qk_norm_attention.py (2)
65-66: Consider adding CUDA availability check.The code directly creates CUDA streams and events without checking if CUDA is available. This could cause issues in CPU-only environments or during testing.
Consider adding a CUDA availability check:
- self.aux_stream = torch.cuda.Stream() - self.ln_events = [torch.cuda.Event(), torch.cuda.Event()] + if torch.cuda.is_available(): + self.aux_stream = torch.cuda.Stream() + self.ln_events = [torch.cuda.Event(), torch.cuda.Event()] + else: + self.aux_stream = None + self.ln_events = NoneAnd update the
apply_qk_normmethod to handle the None case appropriately.
109-111: Improve assertion error message.The assertion provides a clear constraint but could be more informative about why this requirement exists.
- assert k is None and v is None, "The input should be a concatenated qkv tensor to apply_qk_norm_rope" + assert k is None and v is None, ( + "When fuse_qk_norm_rope=True, the input must be a concatenated qkv tensor " + "(k and v should be None) for the fused operator to work correctly" + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/_torch/models/modeling_qwen3.py(2 hunks)tensorrt_llm/_torch/modules/qk_norm_attention.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/modules/qk_norm_attention.pytensorrt_llm/_torch/models/modeling_qwen3.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/modules/qk_norm_attention.pytensorrt_llm/_torch/models/modeling_qwen3.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/qk_norm_attention.py
50-50: Line too long (170 > 120)
(E501)
🔇 Additional comments (2)
tensorrt_llm/_torch/models/modeling_qwen3.py (2)
13-13: LGTM! Clean integration of the new attention module.The import change correctly replaces the previous attention implementation with the new
QKNormRoPEAttentionmodule.
30-33: LGTM! Proper instantiation of QKNormRoPEAttention.The replacement of
Qwen3AttentionwithQKNormRoPEAttentionis cleanly implemented, maintaining the same interface withmodel_configandlayer_idxparameters.
befbf0f to
0c51a59
Compare
|
PR_Github #14981 [ run ] completed with state |
|
/bot run |
|
PR_Github #14985 [ run ] triggered by Bot |
|
PR_Github #14985 [ run ] completed with state |
|
/bot run |
94841cd to
0443a6d
Compare
|
/bot run |
|
PR_Github #15115 [ run ] triggered by Bot |
|
PR_Github #15117 [ run ] triggered by Bot |
|
PR_Github #15115 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_qwen3_moe.py (1)
1-1: Add NVIDIA copyright header (2025) at file top.Required by coding guidelines for all source files.
Apply this diff:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 import ostensorrt_llm/_torch/models/modeling_qwen3.py (1)
1-1: Add NVIDIA copyright header (2025) at file top.Required by coding guidelines.
Apply this diff:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 from typing import Optional, Tuple
♻️ Duplicate comments (2)
tensorrt_llm/_torch/modules/qk_norm_attention.py (2)
66-73: Use configured RMSNorm epsilon instead of hardcoded 1e-6.Keep normalization consistent with model config by using the configured eps.
Apply this diff:
- self.q_norm = RMSNorm(hidden_size=self.head_dim, - eps=1e-6, - dtype=pretrained_config.torch_dtype, - has_weights=True) - self.k_norm = RMSNorm(hidden_size=self.head_dim, - eps=1e-6, - dtype=pretrained_config.torch_dtype, - has_weights=True) + self.q_norm = RMSNorm(hidden_size=self.head_dim, + eps=pretrained_config.rms_norm_eps, + dtype=pretrained_config.torch_dtype, + has_weights=True) + self.k_norm = RMSNorm(hidden_size=self.head_dim, + eps=pretrained_config.rms_norm_eps, + dtype=pretrained_config.torch_dtype, + has_weights=True)
97-104: Wrap fused operator call with availability check and clear error handling.Improve robustness and debuggability if the custom op is missing or fails.
Apply this diff:
- def apply_qk_norm_rope(self, qkv, position_ids): - torch.ops.trtllm.fused_qk_norm_rope( - qkv, self.num_heads, self.num_key_value_heads, - self.num_key_value_heads, self.head_dim, - self.q_norm.variance_epsilon, self.q_norm.weight, - self.k_norm.weight, self.pos_embd_params.rope.theta, - self.pos_embd_params.is_neox, position_ids.view(-1)) - return qkv, None, None + def apply_qk_norm_rope(self, qkv, position_ids): + op = getattr(torch.ops.trtllm, "fused_qk_norm_rope", None) + if op is None: + raise RuntimeError( + "torch.ops.trtllm.fused_qk_norm_rope is not available in this build." + ) + try: + op(qkv, + self.num_heads, + self.num_key_value_heads, + self.num_key_value_heads, # verify this argument; see follow-up comment + self.head_dim, + self.q_norm.variance_epsilon, + self.q_norm.weight, + self.k_norm.weight, + self.pos_embd_params.rope.theta, + self.pos_embd_params.is_neox, + position_ids.view(-1)) + except Exception as e: + raise RuntimeError(f"fused_qk_norm_rope failed: {e}") from e + return qkv, None, None
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/qk_norm_attention.py (1)
77-86: Add short docstrings for internal helpers.Concise docstrings aid maintainability and clarify expected shapes/semantics.
Apply this diff:
def apply_qk_norm(self, q, k): + """Apply RMSNorm to Q and K in parallel using an auxiliary CUDA stream. + + Args: + q (torch.Tensor): Query tensor shaped (..., q_size, head_dim) flattened internally. + k (torch.Tensor): Key tensor shaped (..., kv_size, head_dim) flattened internally. + Returns: + Tuple[torch.Tensor, torch.Tensor]: Normalized Q and K with original leading dims restored. + """ @@ def apply_qk_norm_rope(self, qkv, position_ids): + """Apply fused Q/K RMSNorm and RoPE via custom TRTLLM operator. + + Args: + qkv (torch.Tensor): Concatenated QKV projection tensor before splitting. + position_ids (torch.Tensor): Positions shaped (batch*seq,) or broadcastable. + Returns: + Tuple[torch.Tensor, None, None]: Modified qkv and placeholders to match base interface. + """Also applies to: 97-104
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/_torch/models/modeling_qwen3.py(1 hunks)tensorrt_llm/_torch/models/modeling_qwen3_moe.py(2 hunks)tensorrt_llm/_torch/modules/qk_norm_attention.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/models/modeling_qwen3.pytensorrt_llm/_torch/models/modeling_qwen3_moe.pytensorrt_llm/_torch/modules/qk_norm_attention.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/models/modeling_qwen3.pytensorrt_llm/_torch/models/modeling_qwen3_moe.pytensorrt_llm/_torch/modules/qk_norm_attention.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/modules/qk_norm_attention.py (1)
98-103: Please confirm thefused_qk_norm_ropeoperator signature
We couldn’t locate the C++/CUDA registration in this repo, yet the Python calls passnum_key_value_headstwice (3rd and 4th args). Verify in the native plugin that these parameters match the operator’s expected signature to avoid subtle bugs.• tensorrt_llm/_torch/modules/qk_norm_attention.py: line 98
• tensorrt_llm/_torch/models/modeling_exaone4.py: line 125
• tests/unittest/_torch/thop/test_fused_qk_norm_rope.py: line 146tensorrt_llm/_torch/models/modeling_qwen3_moe.py (1)
169-179: Switch to QKNormRoPEAttention looks correct and consistent.Initialization matches the new module’s constructor (hidden_size, heads, kv_heads, positions, bias, dtype, dense_bias, config). No pos_embd_params is passed here, which aligns with QKNormRoPEAttention computing it from config.
0443a6d to
cebaaa1
Compare
|
/bot run |
|
PR_Github #15137 [ run ] triggered by Bot |
|
PR_Github #15117 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_qwen3.py (1)
56-66: Fix forward() return annotation to match actual return typeforward() currently annotates a single torch.Tensor but returns a tuple (hidden_states, residual). Update the annotation to prevent type-checking confusion and reader errors.
Apply this diff:
- ) -> torch.Tensor: + ) -> Tuple[torch.Tensor, torch.Tensor]:
♻️ Duplicate comments (4)
tensorrt_llm/_torch/models/modeling_qwen3.py (2)
22-29: Good fix: init no longer has an invalid return annotationPrevious feedback about init returning a Tuple is addressed; the signature now correctly returns None implicitly.
30-40: Correct removal of unsupported pos_embd_params when constructing QKNormRoPEAttentionThe constructor no longer passes pos_embd_params, aligning with QKNormRoPEAttention’s API. This avoids a TypeError at instantiation.
tensorrt_llm/_torch/modules/qk_norm_attention.py (2)
81-88: Use model-configured RMSNorm epsilon instead of hardcoded 1e-6Match other layers’ eps and avoid silent divergence across components.
Apply this diff:
- self.q_norm = RMSNorm(hidden_size=self.head_dim, - eps=1e-6, + self.q_norm = RMSNorm(hidden_size=self.head_dim, + eps=pretrained_config.rms_norm_eps, dtype=pretrained_config.torch_dtype, has_weights=True) - self.k_norm = RMSNorm(hidden_size=self.head_dim, - eps=1e-6, + self.k_norm = RMSNorm(hidden_size=self.head_dim, + eps=pretrained_config.rms_norm_eps, dtype=pretrained_config.torch_dtype, has_weights=True)
112-120: Wrap fused operator call with availability check and clear error handlingImprove debuggability when the custom op is missing or fails.
Apply this diff:
- torch.ops.trtllm.fused_qk_norm_rope( - qkv, self.num_heads, self.num_key_value_heads, - self.num_key_value_heads, self.head_dim, - self.q_norm.variance_epsilon, self.q_norm.weight, - self.k_norm.weight, self.pos_embd_params.rope.theta, - self.pos_embd_params.is_neox, position_ids.view(-1)) + if not hasattr(torch.ops, "trtllm") or not hasattr(torch.ops.trtllm, "fused_qk_norm_rope"): + raise RuntimeError("torch.ops.trtllm.fused_qk_norm_rope is not available. " + "Ensure the TRT-LLM extension is built and loaded.") + try: + torch.ops.trtllm.fused_qk_norm_rope( + qkv, self.num_heads, self.num_key_value_heads, + self.num_key_value_heads, self.head_dim, + self.q_norm.variance_epsilon, self.q_norm.weight, + self.k_norm.weight, self.pos_embd_params.rope.theta, + self.pos_embd_params.is_neox, position_ids.view(-1)) + except (RuntimeError, AttributeError) as e: + raise RuntimeError(f"Failed to apply fused_qk_norm_rope: {e}") from eIf helpful, I can add a lightweight unit test stub that conditionally skips when the fused op is unavailable and validates the non-fused path equivalence on small inputs.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/models/modeling_qwen3.py (1)
130-134: Minor readability: simplify the XOR check for inputs selectionThe XOR condition is correct but less common in codebases. A more idiomatic form is equality on None-ness.
Apply this diff:
- if (input_ids is None) ^ (inputs_embeds is not None): + if (input_ids is None) == (inputs_embeds is None): raise ValueError( "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one" )tensorrt_llm/_torch/modules/qk_norm_attention.py (1)
28-33: Docstring could specify expected tensor shapes and execution pathsConsider briefly documenting the expected input shapes for q, k, v in both fused and non-fused modes and how position_ids should be shaped. This reduces ambiguity for future maintainers.
I can add a short Google-style docstring with shapes and examples if you’d like.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/_torch/models/modeling_qwen3.py(1 hunks)tensorrt_llm/_torch/models/modeling_qwen3_moe.py(2 hunks)tensorrt_llm/_torch/modules/qk_norm_attention.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/models/modeling_qwen3_moe.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/models/modeling_qwen3.pytensorrt_llm/_torch/modules/qk_norm_attention.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/models/modeling_qwen3.pytensorrt_llm/_torch/modules/qk_norm_attention.py
🧠 Learnings (2)
📚 Learning: 2025-08-12T10:28:57.288Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.288Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files
Applied to files:
tensorrt_llm/_torch/modules/qk_norm_attention.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/modules/qk_norm_attention.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/_torch/models/modeling_qwen3.py (1)
141-149: Confirm non-Optional position_ids expectationQwen3DecoderLayer.forward expects position_ids: torch.IntTensor (non-Optional), but Qwen3Model.forward accepts Optional and forwards it. Ensure position_ids is always provided at runtime; otherwise, type checkers and future refactors might miss a None case.
Would you like me to scan the repository for call sites that invoke Qwen3Model.forward without position_ids (e.g., decoder-only paths that rely on attn_metadata) and report them?
tensorrt_llm/_torch/modules/qk_norm_attention.py (3)
1-2: Header compliance: SPDX header present and correct yearMeets the repository’s requirement to prepend NVIDIA headers.
72-75: Correct rope_fusion gating relative to fuse_qk_norm_ropeThe rope_fusion=not self.fuse_qk_norm_rope setting matches the comment and prevents duplicate RoPE application when the fused QK-norm+RoPE path is used.
112-118: No action required: fused_qk_norm_rope arguments are correctI’ve confirmed in cpp/tensorrt_llm/thop/fusedQKNormRopeOp.cpp that the operator signature is:
fused_qk_norm_rope( Tensor& qkv, int64 num_heads_q, int64 num_heads_k, int64 num_heads_v, int64 head_dim, float eps, Tensor q_weight, Tensor k_weight, float base, bool is_neox, Tensor position_ids )Passing
self.num_heads,self.num_key_value_heads,self.num_key_value_headsfor q/k/v aligns exactly with that signature—and the pytest in tests/unittest/_torch/thop/test_fused_qk_norm_rope.py uses the same ordering. No changes needed.
|
PR_Github #15137 [ run ] completed with state |
|
PR_Github #15814 [ run ] completed with state |
c0ea121 to
18c9e8d
Compare
|
/bot run |
|
PR_Github #16878 [ run ] triggered by Bot |
|
PR_Github #16878 [ run ] completed with state |
18c9e8d to
ca55c24
Compare
|
/bot run |
|
PR_Github #16894 [ run ] triggered by Bot |
|
PR_Github #16894 [ run ] completed with state |
ca55c24 to
556a0ef
Compare
|
/bot run |
|
PR_Github #16974 [ run ] triggered by Bot |
|
PR_Github #16974 [ run ] completed with state |
…hanism in Qwen3 model - Added QKNormRoPEAttention class to apply QK normalization and RoPE. - Replaced Qwen3Attention with QKNormRoPEAttention in Qwen3DecoderLayer and Qwen3MoEDecoderLayer. - Removed unused imports and code related to the previous attention implementation. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Updated QKNormRoPEAttention constructor to accept parameters directly instead of a model_config object. - Modified Qwen3DecoderLayer and Qwen3MoEDecoderLayer to pass specific configuration values to QKNormRoPEAttention. - Enhanced clarity and flexibility in attention layer configuration. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Included SPDX license headers for compliance and clarity. - Ensured proper licensing information is present in the QKNormAttention implementation. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Enhanced QKNormRoPEAttention to accept positional embedding parameters. - Inherit QKNormRoPEAttention in Qwen3Attention, specialize it for Qwen3 model. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Updated Gemma3Attention class to extend QKNormRoPEAttention, enhancing its functionality. - Removed redundant QK normalization methods and integrated them into the new parent class. - Adjusted initialization parameters to align with the new attention mechanism. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Updated Exaone4Attention class to extend QKNormRoPEAttention, streamlining the attention mechanism. - Removed redundant QK normalization methods and integrated functionality into the parent class. - Adjusted initialization parameters for improved clarity and flexibility in attention layer configuration. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Enhanced the logic to check for yarn usage by verifying both rope_scaling and rope_type in the configuration. - Improved code clarity by updating comments to reflect the new checks. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
556a0ef to
a5ec935
Compare
|
/bot run |
|
PR_Github #17529 [ run ] triggered by Bot |
|
PR_Github #17529 [ run ] completed with state |
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Summary by CodeRabbit
New Features
Performance Improvements
Compatibility
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.