-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Sampler] Vectorized sampling (simplified) #1048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@zhuohan123 Could you compare the performance against the original PR? Thanks. |
7B LLaMA on 80GB-A100 throughput before this PR:
After this PR:
|
For the case in #820, this PR:
Current main on my machine:
#820 (merged with main, with main's throughput benchmark script) on my machine:
#820 with benchmark script in #820
|
@WoosukKwon This PR is ready for review. |
@zhuohan123 I've run Current main:
This PR:
Could you take a look at this problem? |
@zhuohan123 Please let me know if the PR is ready for review. |
@WoosukKwon Just fixed the correctness issue. Also slightly boost the performance:
The PR is ready for review now. @WoosukKwon |
I've tried
Is this just coincidence? The two outputs were different from each other when I used current |
Thanks for catching this! Just fixed. The result should match main now. |
Further optimized performance. Now this PR is faster than #820:
|
Same benchmark on the latest commit:
|
@zhuohan123 Sent. Yes, but i current mostly use WizardCoder, and it just work without change kernel precision. Codellama indeed should apply that to improve performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I would also advise carrying over the vectorized apply_penalties
code from my PR which is faster than the current numpy-based solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 Thanks for the awesome work! The code looks very clean to me. Left minor comments.
@WoosukKwon @Yard1 Thanks for your detailed review! All review comments are fixed. Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Hi, does this PR change the output of a model, given the same input and random seed? Comparing the output of Old
New
|
@masahi Yes, the PR has changed the outputs of random sampling, because now we use batched operations for sampling. |
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
…lm-project#1048) The make_attn_bias in hpu_model_runner doesn't cover the non-causal embedding model mask set and also vertical mask off is not set when merged prefill is enabled.
Simplified version of #820. This version does not have any complicated torch operations.
cc @Yard1
TODOs: