KEMBAR78
Qwen3-Next support by yizhang2077 · Pull Request #10233 · sgl-project/sglang · GitHub
Skip to content

Conversation

yizhang2077
Copy link
Collaborator

@yizhang2077 yizhang2077 commented Sep 9, 2025

Motivation

ref #10306
support qwen3-next/qwen3-next-mtp

Modifications

  1. add MambaPool / HybridReqTokenPool to allocate mamba cache
  2. add HybridLinearKVPool to avoid kv cache allocation in linear layers
  3. add hybrid linear attention backend
  4. support qwen3-next basic model
  5. support qwen3-next mtp / use MambaStateUpdateCudaGraphRunner to accelerate update mamba/conv state in verify stage

Accuracy Tests

python3 benchmark/gsm8k/bench_sglang.py  --num-questions 1000
Accuracy: 0.945
Invalid: 0.000
Latency: 113.899 s
Output throughput: 1470.560 token/s

Benchmarking and Profiling

Basic model

TP4 H100
python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            183.788 |             183.860 |        113.893 |          112.008 |       129.718 |          5.330 |            5.329 |         5.331 |               183.860 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            590.151 |             589.918 |        197.684 |          205.081 |       380.452 |          6.580 |            6.559 |         6.773 |               147.480 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |           1616.684 |            1617.382 |        232.908 |          202.873 |       522.421 |          9.630 |            9.698 |        10.238 |               101.086 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |           2441.358 |            2443.195 |        252.984 |          206.874 |       602.096 |         12.814 |           12.977 |        14.023 |                76.350 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  4 |            64.000 |           3654.720 |            3653.131 |        342.450 |          283.910 |      1065.410 |         17.126 |           17.214 |        18.953 |                57.080 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  5 |           128.000 |           5639.235 |            5638.653 |        512.799 |          351.112 |      2012.287 |         22.148 |           22.275 |        24.715 |                44.052 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  6 |           256.000 |           3430.466 |            3430.712 |        738.410 |          384.126 |      3864.820 |         73.815 |           74.116 |        77.129 |                13.401 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

MTP

H100

# without mtp
python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4
python3 -m sglang.test.send_one
acc_length=1.00
speed=180.20 token/s

# with mtp
python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4 --speculative-num-steps 3  --speculative-eagle-topk 1  --speculative-num-draft-tokens 4 --speculative-algo NEXTN
python3 -m sglang.test.send_one
acc_length=3.32
speed=304.56 token/s

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yizhang2077, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the Qwen3 Next model, featuring a sophisticated hybrid attention architecture. The core innovation lies in seamlessly integrating both traditional full attention and efficient linear attention (based on State Space Models) within the same model. This integration is backed by highly optimized CUDA kernels and a tailored memory management system, aiming to deliver significant performance improvements. Additionally, the changes lay the groundwork for speculative decoding, further enhancing inference speed for this new model.

Highlights

  • Qwen3 Next Model Integration: Full integration of the Qwen3 Next model architecture, featuring a novel hybrid attention mechanism that combines traditional full attention layers with new linear attention layers.
  • Mamba/SSM Kernel Implementation: Deep integration of State Space Model (SSM) concepts, including highly optimized CUDA kernels for causal 1D convolutions and selective scan, enhancing efficiency for linear attention layers.
  • Optimized Memory Management: Introduction of specialized memory pools (HybridReqToTokenPool, MambaPool, HybridLinearKVPool) designed to efficiently manage the distinct memory requirements of the hybrid attention architecture.
  • Performance Enhancements: Leveraging Triton kernels for various attention sub-operations (e.g., gated delta rule, scaled dot KKT, cumulative sum, L2 normalization, fused recurrent, fused sigmoid gating recurrent) and dual-stream processing for MoE layers to boost performance.
  • Speculative Decoding Support: Extension of speculative decoding capabilities to the Qwen3 Next model, enabling faster inference through the integration of a dedicated MTP (Multi-Token Prediction) variant.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yizhang2077 yizhang2077 changed the title [WIP] Qwen3 Next [WIP] Qwen3-Next support Sep 9, 2025
@yizhang2077 yizhang2077 marked this pull request as draft September 9, 2025 16:24
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Qwen3-Next model, a hybrid architecture featuring both standard and linear attention layers. The changes are extensive, including a new model configuration, model implementation, custom CUDA kernels for linear attention, and updates to memory management and scheduling to accommodate the hybrid design. My review has identified a critical issue in the model configuration that would cause a runtime error, along with a couple of medium-severity issues regarding code cleanup and potentially dead code. Overall, the core logic for the new model support seems to be in place, but some refinements are needed before merging.

layer_type_list = []

for l in range(self.num_hidden_layers):
if (l + 1) % self.full_attention_interval == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The attribute full_attention_interval is used here but is not defined in the __init__ method of Qwen3NextConfig. This will cause an AttributeError at runtime. Please add full_attention_interval as a parameter to the __init__ method and set it as an instance attribute.

For example:

# In Qwen3NextConfig.__init__
...
        mlp_only_layers=[],
        layer_types=None,
        full_attention_interval=4,  # Add with a sensible default
        **kwargs,
    ):
        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
        ...
        self.mlp_only_layers = mlp_only_layers
        self.full_attention_interval = full_attention_interval # Add this line

def update_mamba_state_after_mtp_verify(self, accepted_length, model):
request_number = accepted_length.shape[0]
# QQ: step = spec num_draft token num
num_draft_tokens = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment appears to be a temporary developer note. Please remove it for better code clarity.

@@ -0,0 +1,182 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This newly added file appears to be unused in the pull request. The logic for Mamba cache management seems to be implemented in MambaPool and HybridReqToTokenPool within memory_pool.py. If this file is indeed dead code, please consider removing it to improve maintainability.

@yizhang2077 yizhang2077 changed the title [WIP] Qwen3-Next support Qwen3-Next support Sep 11, 2025
@yizhang2077 yizhang2077 marked this pull request as ready for review September 11, 2025 07:12
@zhyncs
Copy link
Member

zhyncs commented Sep 11, 2025

ref #10311

@zhyncs zhyncs merged commit 30c6e1f into main Sep 11, 2025
3 of 51 checks passed
@zhyncs zhyncs deleted the qwen3_next branch September 11, 2025 11:11
@Betelgeu
Copy link

Dos qwen3-next support pd-disag in sglang?

@yiakwy-xpu-ml-framework-team
Copy link
Contributor

yiakwy-xpu-ml-framework-team commented Sep 23, 2025

@zhyncs @yizhang2077 please review this PR that we added moe tuning files (since we found this is needed in practice).

The performance of python3 -m sglang.test.send_one boosted from 130 toks/s to 160 toks/s

#10794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants