-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Qwen3-Next support #10233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen3-Next support #10233
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yizhang2077, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive support for the Qwen3 Next model, featuring a sophisticated hybrid attention architecture. The core innovation lies in seamlessly integrating both traditional full attention and efficient linear attention (based on State Space Models) within the same model. This integration is backed by highly optimized CUDA kernels and a tailored memory management system, aiming to deliver significant performance improvements. Additionally, the changes lay the groundwork for speculative decoding, further enhancing inference speed for this new model.
Highlights
- Qwen3 Next Model Integration: Full integration of the Qwen3 Next model architecture, featuring a novel hybrid attention mechanism that combines traditional full attention layers with new linear attention layers.
- Mamba/SSM Kernel Implementation: Deep integration of State Space Model (SSM) concepts, including highly optimized CUDA kernels for causal 1D convolutions and selective scan, enhancing efficiency for linear attention layers.
- Optimized Memory Management: Introduction of specialized memory pools (
HybridReqToTokenPool
,MambaPool
,HybridLinearKVPool
) designed to efficiently manage the distinct memory requirements of the hybrid attention architecture. - Performance Enhancements: Leveraging Triton kernels for various attention sub-operations (e.g., gated delta rule, scaled dot KKT, cumulative sum, L2 normalization, fused recurrent, fused sigmoid gating recurrent) and dual-stream processing for MoE layers to boost performance.
- Speculative Decoding Support: Extension of speculative decoding capabilities to the Qwen3 Next model, enabling faster inference through the integration of a dedicated MTP (Multi-Token Prediction) variant.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the Qwen3-Next model, a hybrid architecture featuring both standard and linear attention layers. The changes are extensive, including a new model configuration, model implementation, custom CUDA kernels for linear attention, and updates to memory management and scheduling to accommodate the hybrid design. My review has identified a critical issue in the model configuration that would cause a runtime error, along with a couple of medium-severity issues regarding code cleanup and potentially dead code. Overall, the core logic for the new model support seems to be in place, but some refinements are needed before merging.
layer_type_list = [] | ||
|
||
for l in range(self.num_hidden_layers): | ||
if (l + 1) % self.full_attention_interval == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attribute full_attention_interval
is used here but is not defined in the __init__
method of Qwen3NextConfig
. This will cause an AttributeError
at runtime. Please add full_attention_interval
as a parameter to the __init__
method and set it as an instance attribute.
For example:
# In Qwen3NextConfig.__init__
...
mlp_only_layers=[],
layer_types=None,
full_attention_interval=4, # Add with a sensible default
**kwargs,
):
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
...
self.mlp_only_layers = mlp_only_layers
self.full_attention_interval = full_attention_interval # Add this line
def update_mamba_state_after_mtp_verify(self, accepted_length, model): | ||
request_number = accepted_length.shape[0] | ||
# QQ: step = spec num_draft token num | ||
num_draft_tokens = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,182 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
40b8dbb
to
2bbf305
Compare
ba132ed
to
4cf5094
Compare
ref #10311 |
fcfbc56
to
192a3e3
Compare
Dos qwen3-next support pd-disag in sglang? |
@zhyncs @yizhang2077 please review this PR that we added moe tuning files (since we found this is needed in practice). The performance of |
Motivation
ref #10306
support qwen3-next/qwen3-next-mtp
Modifications
MambaPool
/HybridReqTokenPool
to allocate mamba cacheHybridLinearKVPool
to avoid kv cache allocation in linear layersMambaStateUpdateCudaGraphRunner
to accelerate update mamba/conv state in verify stageAccuracy Tests
Benchmarking and Profiling
Basic model
MTP
Checklist