KEMBAR78
optimize update verify by yizhang2077 · Pull Request #10277 · sgl-project/sglang · GitHub
Skip to content

Conversation

@yizhang2077
Copy link
Collaborator

@yizhang2077 yizhang2077 commented Sep 10, 2025

Motivation

optimize decode and target verify.

python3 -m sglang.test.send_one

before
speed=148.46 token/s
after
speed=159.06 token/s

Accuracy: 0.955
Invalid: 0.000
Latency: 43.600 s
Output throughput: 740.707 token/s

MTP 3/1/4

acc_length=3.32
speed=289.26 token/s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yizhang2077, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the inference performance of the Qwen3 model. It achieves this by restructuring how certain attention-related tensors are processed, leading to a measurable increase in token throughput. The changes are designed to optimize the underlying computations, particularly for CUDA graph execution paths.

Highlights

  • Performance Optimization: Improved token processing speed from 148.46 token/s to 159.06 token/s by optimizing decode and target verification in the Qwen3 model.
  • Code Refactoring: Refactored the fused_qkvzba_split_reshape_cat_kernel and fused_qkvzba_split_reshape_cat functions to handle QKVZ and BA components separately, enhancing modularity and potential for further optimization.
  • CUDA Graph Integration: Updated the forward method to leverage the newly optimized fused operation specifically when executing within a CUDA graph, ensuring performance benefits are applied in relevant scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the fused_qkvzba_split_reshape_cat operation by splitting the input tensor and enabling the fused Triton kernel for more cases, including TARGET_VERIFY mode. This is a good optimization that, according to the PR description, improves performance. The changes are well-contained and the logic is sound. I have one minor stylistic suggestion.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@hebiao064 hebiao064 merged commit 3a22e31 into qwen3_next Sep 10, 2025
1 check failed
@hebiao064 hebiao064 deleted the qwen3_next_opt branch September 10, 2025 18:33
yizhang2077 added a commit that referenced this pull request Sep 11, 2025
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants