Automatically set `max_num_batched_tokens` #1198

WoosukKwon · 2023-09-27T16:27:28Z

This PR removes the default value (2560) of max_num_batched_tokens and sets it based on the model's maximum length.

WoosukKwon · 2023-09-27T16:38:28Z

@zhuohan123 Sorry for the commits after requesting your review. Now the PR is ready for review!

zhuohan123

LGTM! Thanks for the fix!

This PR choose the HPU device according to the local rank instead of the first available one. Choosing the first available HPU will results in: - random mapping between the `local_rank` and the `module_id` as the process of each rank starts in random order. - failure to select specified devices with `HABANA_VISIBLE_MODULES`. The random mapping may cause cross-NUMA access in inter-node pipeline-parallel and cross-group HCCL call for PCIe SKU.

- [x] [vllm-project#1135](HabanaAI#1135) - [x] [vllm-project#1149](HabanaAI#1149) - [x] [vllm-project#1198](HabanaAI#1198) - [x] demo_proxy.py --------- Signed-off-by: zhenwei <zhenweiliu@habana.ai> Co-authored-by: Youlei Yang <youlei.yang@intel.com>

WoosukKwon added 2 commits September 27, 2023 16:23

Automatically set max_num_batched_tokens

26cabdc

Minor

ca9b358

WoosukKwon requested a review from zhuohan123 September 27, 2023 16:27

WoosukKwon added 2 commits September 27, 2023 16:29

Minor

c155430

Fix dtype

4f715d6

WoosukKwon linked an issue Sep 27, 2023 that may be closed by this pull request

Automatically configure max_num_batched_tokens based on model length #1189

Closed

WoosukKwon mentioned this pull request Sep 27, 2023

[v0.2.0] Release Tracker #1089

Closed

5 tasks

zhuohan123 approved these changes Sep 27, 2023

View reviewed changes

WoosukKwon merged commit a19bc5c into main Sep 27, 2023

WoosukKwon deleted the auto-max-batch branch September 27, 2023 23:34

WoosukKwon mentioned this pull request Sep 27, 2023

Automatically configure max_num_batched_tokens based on model length #1189

Closed

yunfeng-scale mentioned this pull request Oct 24, 2023

llama should have None max length scaleapi/llm-engine#348

Merged

katitizhou mentioned this pull request Nov 16, 2023

benchmark_latency.py will hang when --batchsize=1 and --n=2 #1658

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Automatically configure max_num_batched_tokens (vllm-project#1198)

69decbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Automatically set `max_num_batched_tokens` #1198

Automatically set `max_num_batched_tokens` #1198

Uh oh!

WoosukKwon commented Sep 27, 2023

Uh oh!

WoosukKwon commented Sep 27, 2023

Uh oh!

zhuohan123 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Automatically set max_num_batched_tokens #1198

Automatically set max_num_batched_tokens #1198

Uh oh!

Conversation

WoosukKwon commented Sep 27, 2023

Uh oh!

WoosukKwon commented Sep 27, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Automatically set `max_num_batched_tokens` #1198

Automatically set `max_num_batched_tokens` #1198