Allocate more shared memory to attention kernel #1154

Yard1 · 2023-09-23T00:12:59Z

Makes use of additional shared memory present on compute capability >=7.0 cards to support longer context length in the attention kernel.

See https://stackoverflow.com/questions/63757245/using-maximum-shared-memory-in-cuda for details.

As pointed out by @WoosukKwon offline, ideally we would also store logits inside the kernel in float16 instead of float32 as the accuracy loss should be minimal. This will enable even longer context lengths.

Note that the buffer of 512 * sizeof(float32) may be too conservative, but this is still going to result in more supported tokens than ~11k previously. The attention test has been ran on A10 and A100 successfully.

With this PR, the supported context lengths with current kernel (float32 logits) will be:

CC 7.5 (Turing): 64KiB shared memory -> 16328 tokens
CC 7.0 (Volta): 96KiB shared memory -> 24984 tokens
CC 8.6 (Ampere A10): 100KiB shared memory -> 25128 tokens
CC 8.0 (Ampere A100): 160KiB shared memory -> 39936 tokens
CC 9.0 (Hopper): 227KiB shared memory -> 57600 tokens

Closes #905

Yard1 · 2023-09-23T00:13:07Z

cc @WoosukKwon @LiuXiaoxuanPKU

Yard1 · 2023-09-23T00:15:01Z

~~We could also add a python assertion for this?~~ added

tests/kernels/test_attention.py

This reverts commit 926c507.

esmeetu · 2023-09-23T04:01:53Z

hi, @Yard1 I have a question here, if i using dtype=float16 for model inference, does it will affect accuracy when changing buffer logits from float32 to float16 to support longer context?

Yard1 · 2023-09-23T04:03:54Z

I am not sure, @WoosukKwon would know best.

WoosukKwon

@Yard1 Thanks for the quick fix! I'm a bit worried about the performance since we manually adjusted the shared memory size, but it seems the performance does not change by the fix. 👍

Left some questions and comments. Please take a look.

WoosukKwon · 2023-09-26T17:51:42Z

tests/kernels/test_attention.py

-MAX_SEQ_LEN = 8192
+float_bytes = torch.finfo(torch.float).bits / 8
+# This will change dependning on the compute capability.
+# -7 as it will be padded to 8 anyway, -512 as a buffer


Could you elaborate more on this?

-7 as it will be padded to 8 anyway

A quick question: How did you choose 512 for the buffer size?

I think I got (64 + 256) * sizeof(float32) for other __shared__ variables by reading the CUDA kernel, so I just rounded it up to 512 * sizeof(float32) to be safe. But it may be too conservative.

@WoosukKwon would appreciate if you could provide a more accurate measurement :)

@Yard1 Do you mean

__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];

and

__shared__ float red_smem[2 * NUM_WARPS];

?

yes - they are included in the shared memory usage - we should set the buffer to upper bound of those

If my calculation is correct, the size of q_vecs is head_size * sizeof(scalar_t) <= 256 * 4 = 1024. The size of red_smem is obviously 64 * 4 = 256. In total, it's 1280 bytes (=320 float elements). So 512 is actually a bit conservative upper bound. However, I think this is acceptable.

tests/kernels/test_attention.py

vllm/utils.py

WoosukKwon · 2023-09-26T18:46:04Z

vllm/utils.py

+    # Follows the logic in
+    # attention_kernels.cu::single_query_cached_kv_attention_launcher
+    max_shared_mem = get_max_shared_mem_bytes()
+    float32_bytes = torch.finfo(torch.float).bits // 8


Isn't this always 4?

It should technically be, but that way it ensures it's always true irrespective of the platform/implementation and is also self documenting

To my knowledge, the size of float is defined as an IEEE standard and is independent from the underlying machine architecture (unlike integer types). That being said, I like that this is self-documenting. Let's keep it!

vllm/utils.py

csrc/attention/attention_kernels.cu

WoosukKwon

@Yard1 LGTM! Thanks again for the PR! Left very minor style issues. Please fix them before merge.

csrc/cuda_utils_kernels.cu

tests/kernels/test_attention.py

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

csrc/cuda_utils.cpp

csrc/cuda_utils_kernels.cu

csrc/cuda_utils.cpp

esmeetu · 2023-09-27T05:12:12Z

@Yard1 Great! I tested long prompt using this PR. It doesn't crash any more until now.
And i think _check_if_can_support_max_seq_len should check minimum of (max_num_batched_tokens, max_seq_len). If not, it will always get check error in my GPU.
Example:

max_seq_len:16384,
block_size:16,
max_shared_mem:65536,
float32_bytes:4,
padded_max_seq_len:16399.0, required_shared_mem:67644.0

WoosukKwon · 2023-09-27T05:25:42Z

Hi @esmeetu, Thanks for reporting the issue. I think that's related to how to set max_num_batched_tokens, and thus can be handled in a separate PR.

esmeetu · 2023-09-27T05:36:06Z

@WoosukKwon I didn't get what you mean for how to set that parameter. Doesn't it being set by schedule config?🤔️

Allocate more shared memory to attention kernel

8a6acab

Yard1 added 3 commits September 22, 2023 17:34

Remove unnecessary assert

b236796

Add python check

7709766

Add buffer, make test dynamic

99a1e45

Yard1 marked this pull request as draft September 23, 2023 02:09

Increase buffer

a5085dd

Yard1 marked this pull request as ready for review September 23, 2023 02:31

Yard1 commented Sep 23, 2023

View reviewed changes

tests/kernels/test_attention.py Outdated Show resolved Hide resolved

Yard1 added 4 commits September 22, 2023 19:50

Update tests/kernels/test_attention.py

4ec7cdb

Improve documentation

ece788b

Increase buffer to 1024

926c507

Revert "Increase buffer to 1024"

3be8da4

This reverts commit 926c507.

Yard1 mentioned this pull request Sep 23, 2023

vLLM doesn't support context length exceeding about 13k #905

Closed

WoosukKwon self-requested a review September 25, 2023 23:36

WoosukKwon reviewed Sep 26, 2023

View reviewed changes

Yard1 added 2 commits September 26, 2023 14:48

Add cuda_utils

5ba7b4e

Remove -1

abce5aa

Yard1 requested a review from WoosukKwon September 26, 2023 21:49

Yard1 commented Sep 26, 2023

View reviewed changes

csrc/attention/attention_kernels.cu Outdated Show resolved Hide resolved

Yard1 added 2 commits September 26, 2023 14:50

Update csrc/attention/attention_kernels.cu

6b29529

Update setup

87f85b0

WoosukKwon approved these changes Sep 27, 2023

View reviewed changes

csrc/cuda_utils_kernels.cu Outdated Show resolved Hide resolved

tests/kernels/test_attention.py Outdated Show resolved Hide resolved

Update tests/kernels/test_attention.py

9413f84

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Yard1 commented Sep 27, 2023

View reviewed changes

csrc/cuda_utils.cpp Show resolved Hide resolved

Update csrc/cuda_utils.cpp

fc35890

Yard1 commented Sep 27, 2023

View reviewed changes

csrc/cuda_utils_kernels.cu Outdated Show resolved Hide resolved

Update csrc/cuda_utils_kernels.cu

fe3af58

Yard1 commented Sep 27, 2023

View reviewed changes

csrc/cuda_utils.cpp Show resolved Hide resolved

Minor fix

808926c

Minor fix

4a0c355

WoosukKwon merged commit cf5cb1e into vllm-project:main Sep 27, 2023

LorrinWWW mentioned this pull request Oct 3, 2023

Fix the cuda issue for long prompt inference togethercomputer/vllm-ttgi#2

Merged

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Allocate more shared memory to attention kernel (vllm-project#1154)

aecb5b8

Uh oh!

Allocate more shared memory to attention kernel #1154

Allocate more shared memory to attention kernel #1154

Uh oh!

Conversation

Yard1 commented Sep 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yard1 commented Sep 23, 2023

Uh oh!

Yard1 commented Sep 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

esmeetu commented Sep 23, 2023

Uh oh!

Yard1 commented Sep 23, 2023

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yard1 Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yard1 Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

esmeetu commented Sep 27, 2023

Uh oh!

WoosukKwon commented Sep 27, 2023

Uh oh!

esmeetu commented Sep 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yard1 commented Sep 23, 2023 •

edited

Loading

Yard1 commented Sep 23, 2023 •

edited

Loading

Yard1 Sep 26, 2023 •

edited

Loading

Yard1 Sep 27, 2023 •

edited

Loading