Qwen3-Next support #10233

yizhang2077 · 2025-09-09T16:21:55Z

Motivation

ref #10306
support qwen3-next/qwen3-next-mtp

Modifications

add MambaPool / HybridReqTokenPool to allocate mamba cache
add HybridLinearKVPool to avoid kv cache allocation in linear layers
add hybrid linear attention backend
support qwen3-next basic model
support qwen3-next mtp / use MambaStateUpdateCudaGraphRunner to accelerate update mamba/conv state in verify stage

Accuracy Tests

python3 benchmark/gsm8k/bench_sglang.py  --num-questions 1000
Accuracy: 0.945
Invalid: 0.000
Latency: 113.899 s
Output throughput: 1470.560 token/s

Benchmarking and Profiling

Basic model

TP4 H100
python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            183.788 |             183.860 |        113.893 |          112.008 |       129.718 |          5.330 |            5.329 |         5.331 |               183.860 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            590.151 |             589.918 |        197.684 |          205.081 |       380.452 |          6.580 |            6.559 |         6.773 |               147.480 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |           1616.684 |            1617.382 |        232.908 |          202.873 |       522.421 |          9.630 |            9.698 |        10.238 |               101.086 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |           2441.358 |            2443.195 |        252.984 |          206.874 |       602.096 |         12.814 |           12.977 |        14.023 |                76.350 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  4 |            64.000 |           3654.720 |            3653.131 |        342.450 |          283.910 |      1065.410 |         17.126 |           17.214 |        18.953 |                57.080 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  5 |           128.000 |           5639.235 |            5638.653 |        512.799 |          351.112 |      2012.287 |         22.148 |           22.275 |        24.715 |                44.052 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  6 |           256.000 |           3430.466 |            3430.712 |        738.410 |          384.126 |      3864.820 |         73.815 |           74.116 |        77.129 |                13.401 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

MTP

H100

# without mtp
python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4
python3 -m sglang.test.send_one
acc_length=1.00
speed=180.20 token/s

# with mtp
python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4 --speculative-num-steps 3  --speculative-eagle-topk 1  --speculative-num-draft-tokens 4 --speculative-algo NEXTN
python3 -m sglang.test.send_one
acc_length=3.32
speed=304.56 token/s

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @yizhang2077, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the Qwen3 Next model, featuring a sophisticated hybrid attention architecture. The core innovation lies in seamlessly integrating both traditional full attention and efficient linear attention (based on State Space Models) within the same model. This integration is backed by highly optimized CUDA kernels and a tailored memory management system, aiming to deliver significant performance improvements. Additionally, the changes lay the groundwork for speculative decoding, further enhancing inference speed for this new model.

Highlights

Qwen3 Next Model Integration: Full integration of the Qwen3 Next model architecture, featuring a novel hybrid attention mechanism that combines traditional full attention layers with new linear attention layers.
Mamba/SSM Kernel Implementation: Deep integration of State Space Model (SSM) concepts, including highly optimized CUDA kernels for causal 1D convolutions and selective scan, enhancing efficiency for linear attention layers.
Optimized Memory Management: Introduction of specialized memory pools (HybridReqToTokenPool, MambaPool, HybridLinearKVPool) designed to efficiently manage the distinct memory requirements of the hybrid attention architecture.
Performance Enhancements: Leveraging Triton kernels for various attention sub-operations (e.g., gated delta rule, scaled dot KKT, cumulative sum, L2 normalization, fused recurrent, fused sigmoid gating recurrent) and dual-stream processing for MoE layers to boost performance.
Speculative Decoding Support: Extension of speculative decoding capabilities to the Qwen3 Next model, enabling faster inference through the integration of a dedicated MTP (Multi-Token Prediction) variant.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the Qwen3-Next model, a hybrid architecture featuring both standard and linear attention layers. The changes are extensive, including a new model configuration, model implementation, custom CUDA kernels for linear attention, and updates to memory management and scheduling to accommodate the hybrid design. My review has identified a critical issue in the model configuration that would cause a runtime error, along with a couple of medium-severity issues regarding code cleanup and potentially dead code. Overall, the core logic for the new model support seems to be in place, but some refinements are needed before merging.

gemini-code-assist · 2025-09-09T16:25:39Z

python/sglang/srt/configs/qwen3_next.py

+        layer_type_list = []
+
+        for l in range(self.num_hidden_layers):
+            if (l + 1) % self.full_attention_interval == 0:


The attribute full_attention_interval is used here but is not defined in the __init__ method of Qwen3NextConfig. This will cause an AttributeError at runtime. Please add full_attention_interval as a parameter to the __init__ method and set it as an instance attribute.

For example:

# In Qwen3NextConfig.__init__ ... mlp_only_layers=[], layer_types=None, full_attention_interval=4, # Add with a sensible default **kwargs, ): super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs) ... self.mlp_only_layers = mlp_only_layers self.full_attention_interval = full_attention_interval # Add this line

gemini-code-assist · 2025-09-09T16:25:39Z

python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py

+    def update_mamba_state_after_mtp_verify(self, accepted_length, model):
+        request_number = accepted_length.shape[0]
+        # QQ: step = spec num_draft token num
+        num_draft_tokens = (


This comment appears to be a temporary developer note. Please remove it for better code clarity.

gemini-code-assist · 2025-09-09T16:25:39Z

python/sglang/srt/mem_cache/mamba_cache.py

@@ -0,0 +1,182 @@
+import logging


This newly added file appears to be unused in the pull request. The logic for Mamba cache management seems to be implemented in MambaPool and HybridReqToTokenPool within memory_pool.py. If this file is indeed dead code, please consider removing it to improve maintainability.

…f16) via environment variables (#10258)

Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

This reverts commit a50df30.

zhyncs · 2025-09-11T08:35:05Z

ref #10311

python/pyproject.toml

Betelgeu · 2025-09-15T09:03:19Z

Dos qwen3-next support pd-disag in sglang?

yiakwy-xpu-ml-framework-team · 2025-09-23T09:28:53Z

@zhyncs @yizhang2077 please review this PR that we added moe tuning files (since we found this is needed in practice).

The performance of python3 -m sglang.test.send_one boosted from 130 toks/s to 160 toks/s

#10794

yizhang2077 requested review from BBuf, Edwardf0t1, FlamingoPg, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kssteven418, kushanam, merrymercy, rkooo567, xiezhq-hermann and zhyncs as code owners September 9, 2025 16:21

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

yizhang2077 changed the title ~~[WIP] Qwen3 Next~~ [WIP] Qwen3-Next support Sep 9, 2025

yizhang2077 marked this pull request as draft September 9, 2025 16:24

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

yizhang2077 force-pushed the qwen3_next branch 5 times, most recently from 40b8dbb to 2bbf305 Compare September 9, 2025 17:18

yizhang2077 mentioned this pull request Sep 9, 2025

add flash linear attention triton kernel #10239

Merged

4 tasks

zhyncs assigned yizhang2077, ispobock and hebiao064 Sep 9, 2025

zhyncs added the high priority label Sep 9, 2025

yizhang2077 mentioned this pull request Sep 10, 2025

add dual stream for qwen2_moe #10252

Merged

4 tasks

ispobock assigned merrymercy Sep 10, 2025

hebiao064 and others added 11 commits September 11, 2025 05:03

remove unnecessary import

259d281

fix weight load

57debf8

Fix Qwen3-Next MTP Loading after model update (#10255)

af79f50

remove device sync

31704d0

[Feature] Mamba SSM state supports selecting the data type (fp32 or b…

dd06569

…f16) via environment variables (#10258)

move tuning config into another pr

7606838

add cuda graph runner for verify update

198e67b

fix dp weight load

7b33094

optimize update verify (#10277)

b4f84ac

Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Qwen3 next dual stream opt (#10302)

a116364

Revert "move tuning config into another pr"

4cf5094

This reverts commit a50df30.

yizhang2077 force-pushed the qwen3_next branch from ba132ed to 4cf5094 Compare September 11, 2025 06:42

remove useless file

ca26d37

yizhang2077 mentioned this pull request Sep 11, 2025

Qwen3-Next Running Command #10306

Open

yizhang2077 changed the title ~~[WIP] Qwen3-Next support~~ Qwen3-Next support Sep 11, 2025

yizhang2077 marked this pull request as ready for review September 11, 2025 07:12

zhyncs approved these changes Sep 11, 2025

View reviewed changes

fix ci

192a3e3

yizhang2077 force-pushed the qwen3_next branch from fcfbc56 to 192a3e3 Compare September 11, 2025 08:54

Merge branch 'main' into qwen3_next

ba3665b

zhyncs reviewed Sep 11, 2025

View reviewed changes

python/pyproject.toml Outdated Show resolved Hide resolved

zhyncs added 3 commits September 11, 2025 02:39

upd

4e2ba66

Merge branch 'main' into qwen3_next

fbe401d

Merge branch 'main' into qwen3_next

670ad07

zhyncs merged commit 30c6e1f into main Sep 11, 2025
3 of 51 checks passed

zhyncs deleted the qwen3_next branch September 11, 2025 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3-Next support #10233

Qwen3-Next support #10233

yizhang2077 commented Sep 9, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 9, 2025

Uh oh!

gemini-code-assist bot Sep 9, 2025

Uh oh!

gemini-code-assist bot Sep 9, 2025

Uh oh!

zhyncs commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

Betelgeu commented Sep 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Sep 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Qwen3-Next support #10233

Qwen3-Next support #10233

Conversation

yizhang2077 commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Basic model

MTP

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

zhyncs commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

Betelgeu commented Sep 15, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

yizhang2077 commented Sep 9, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Sep 23, 2025 •

edited

Loading