-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Use TGI-like incremental detokenization #984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use TGI-like incremental detokenization #984
Conversation
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
|
cc @WoosukKwon |
|
Does this PR overlap with #589? |
|
I didn't look too deeply but I think they both should fix the issue! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 Thanks for submitting the PR! Left comments on the coding style and performance impact.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
|
Marking as draft until benchmarking is complete, but it does look like it brings extra overhead @WoosukKwon |
|
@WoosukKwon I have reworked the code to combine the correctness of TGI approach with high performance of vLLM approach. The new iteration of this PR is not only more correct than the current vLLM implementation but also more performant. PTAL! |
|
@Yard1 Could you please update the test as well? I got errors in the test: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 Awesome, Thanks a lot for the PR! This further boosts our performance.
Left several comments on the style.
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 This is super nice. Thanks again for your hard work!
|
I found that aftering applying this pr, some models will produce some unexpected tokens when generating non-ascii texts (especially spaces and
Do you have any ideas? |
|
Let me check- I believe the space up front is actually correct, but not sure about the characters at the end |
Allows us to simplify the code and ensure that the incremental decoding is correct. Notably, when the
tests/models/test_models.pyis ran for meta-llama/Llama-2-7b-hf, this change fixes one of the different outputs between vLLM and HF.vLLM before:
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.It is designed to be used in productionvLLM with this PR:
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It is designed to be used in productionNotice the space after the dot.