KEMBAR78
[FIX] Don't initialize parameter by default by zhuohan123 · Pull Request #1067 · vllm-project/vllm · GitHub
Skip to content

Conversation

@zhuohan123
Copy link
Member

Without this fix, many models cannot run. For example, gpt2 will initialize the parameters for vocab and the following assert will immediately fail.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's actually remove the parameters when refactoring.

@zhuohan123 zhuohan123 merged commit 90979c3 into main Sep 18, 2023
@zhuohan123 zhuohan123 deleted the no-params-init branch October 16, 2023 21:02
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Apr 15, 2025
…llm-project#1067)

Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
amy-why-3459 pushed a commit to amy-why-3459/vllm that referenced this pull request Sep 15, 2025
[CI]Moe alltoall communication optimization
The DeepSeek V3/R1 model has 256 routing experts. During parallel
inference, if the load of an EP rank is high, the overall communication
and computing time is slowed down, which becomes a weakness of parallel
inference because the load is unevenly distributed. However, the data
volume in the prefill phase is large, and the inter-card communication
time consumption/calculation time consumption and the data volume are
closely related to each other. Therefore, less non-linear precision loss
can be used to obtain a near-linear performance improvement.

During parallel inference, global synchronization occurs during
communication. As a result, the card with low load completes the
calculation first and waits for the card with the highest load to
complete the calculation. Therefore, if the load is unbalanced, the card
with high load slows down the overall time consumption. Significant
performance gains can be achieved by discarding a small number of
tokens, which is unacceptable in some precision-sensitive scenarios.
However, similar to quantification, it is a solution that uses an
acceptable precision loss in some scenarios for performance. In
addition, a trade-off between performance and precision can be achieved
by configuring a proportion of discarded tokens.

Perform the test on A3. The batch size is 8 (B), the prompt length is
3.5K tokens (S), and the parallel configuration is as follows: AttnDP=2,
AttnTP=8, MoeTP=1, and MoeEP=16. In this sence, we got a 10%-15%
performance gain.

Plus, the next version, we'll have an alltoallv moe.

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants