-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Allocate more shared memory to attention kernel #1154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocate more shared memory to attention kernel #1154
Conversation
|
|
|
hi, @Yard1 I have a question here, if i using dtype=float16 for model inference, does it will affect accuracy when changing buffer logits from float32 to float16 to support longer context? |
|
I am not sure, @WoosukKwon would know best. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 Thanks for the quick fix! I'm a bit worried about the performance since we manually adjusted the shared memory size, but it seems the performance does not change by the fix. 👍
Left some questions and comments. Please take a look.
tests/kernels/test_attention.py
Outdated
| MAX_SEQ_LEN = 8192 | ||
| float_bytes = torch.finfo(torch.float).bits / 8 | ||
| # This will change dependning on the compute capability. | ||
| # -7 as it will be padded to 8 anyway, -512 as a buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate more on this?
-7 as it will be padded to 8 anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick question: How did you choose 512 for the buffer size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got (64 + 256) * sizeof(float32) for other __shared__ variables by reading the CUDA kernel, so I just rounded it up to 512 * sizeof(float32) to be safe. But it may be too conservative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WoosukKwon would appreciate if you could provide a more accurate measurement :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 Do you mean
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];and
__shared__ float red_smem[2 * NUM_WARPS];?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - they are included in the shared memory usage - we should set the buffer to upper bound of those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my calculation is correct, the size of q_vecs is head_size * sizeof(scalar_t) <= 256 * 4 = 1024. The size of red_smem is obviously 64 * 4 = 256. In total, it's 1280 bytes (=320 float elements). So 512 is actually a bit conservative upper bound. However, I think this is acceptable.
vllm/utils.py
Outdated
| # Follows the logic in | ||
| # attention_kernels.cu::single_query_cached_kv_attention_launcher | ||
| max_shared_mem = get_max_shared_mem_bytes() | ||
| float32_bytes = torch.finfo(torch.float).bits // 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this always 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should technically be, but that way it ensures it's always true irrespective of the platform/implementation and is also self documenting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge, the size of float is defined as an IEEE standard and is independent from the underlying machine architecture (unlike integer types). That being said, I like that this is self-documenting. Let's keep it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 LGTM! Thanks again for the PR! Left very minor style issues. Please fix them before merge.
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
@Yard1 Great! I tested long prompt using this PR. It doesn't crash any more until now. max_seq_len:16384, |
|
Hi @esmeetu, Thanks for reporting the issue. I think that's related to how to set |
|
@WoosukKwon I didn't get what you mean for how to set that parameter. Doesn't it being set by schedule config?🤔️ |
Makes use of additional shared memory present on compute capability >=7.0 cards to support longer context length in the attention kernel.
See https://stackoverflow.com/questions/63757245/using-maximum-shared-memory-in-cuda for details.
As pointed out by @WoosukKwon offline, ideally we would also store logits inside the kernel in float16 instead of float32 as the accuracy loss should be minimal. This will enable even longer context lengths.
Note that the buffer of 512 * sizeof(float32) may be too conservative, but this is still going to result in more supported tokens than ~11k previously. The attention test has been ran on A10 and A100 successfully.
With this PR, the supported context lengths with current kernel (float32 logits) will be:
Closes #905