-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Align max_tokens behavior with openai
#852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the submitted code and it works fine
|
cc @zhuohan123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HermitSun! Thanks for your contribution. Left a small question about why one specific if needs to be changed.
| unlimited_tokens = request.max_tokens == max_model_len | ||
| if not unlimited_tokens and token_num + request.max_tokens > max_model_len: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this if needs to be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If None is directly passed to request.max_tokens, this if statement will complains something like TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'. So I set request.max_tokens=max_model_len to avoid this type mismatch.
And after request.max_tokens=max_model_len, token_num+max_model_len will always be larger than max_model_len. So I add a unlimited_tokens condition to try to avoid this.
In fact, this change is just for the special case that we pass max_tokens=None in the request for a unlimited number of tokens during generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And actually this code may not be able to differ the situation mentioned above from the case we pass max_tokens=max_model_len in the request. Maybe there is a better way, like set a large enough num to represent the inf and reset it to the max available num?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just set request.max_tokens to max_model_len - token_num if request.max_tokens is None so that you only allow the model to generate, at maximum, the number of tokens remaining before filling the context window?
@@ async def check_length
...
if request.max_tokens is None:
request.max_tokens = max_model_len - token_num
elif token_num + request.max_tokens > max_model_len:
return input_ids, create_error_response(
HTTPStatus.BAD_REQUEST,
f"This model's maximum context length is {max_model_len} tokens. "
f"However, you requested {request.max_tokens + token_num} tokens "
f"({token_num} in the messages, "
f"{request.max_tokens} in the completion). "
f"Please reduce the length of the messages or completion.",
)
return input_ids, NoneThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the reminder. This solution is better. I have revised my code.
| unlimited_tokens = request.max_tokens == max_model_len | ||
| if not unlimited_tokens and token_num + request.max_tokens > max_model_len: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just set request.max_tokens to max_model_len - token_num if request.max_tokens is None so that you only allow the model to generate, at maximum, the number of tokens remaining before filling the context window?
@@ async def check_length
...
if request.max_tokens is None:
request.max_tokens = max_model_len - token_num
elif token_num + request.max_tokens > max_model_len:
return input_ids, create_error_response(
HTTPStatus.BAD_REQUEST,
f"This model's maximum context length is {max_model_len} tokens. "
f"However, you requested {request.max_tokens + token_num} tokens "
f"({token_num} in the messages, "
f"{request.max_tokens} in the completion). "
f"Please reduce the length of the messages or completion.",
)
return input_ids, None| if request.max_tokens is None: | ||
| request.max_tokens = max_model_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines wouldn't be needed in the scenario I proposed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the reminder. This solution is better. I will revise my code.
vllm/entrypoints/openai/protocol.py
Outdated
| prompt: Union[List[int], List[List[int]], str, List[str]] | ||
| suffix: Optional[str] = None | ||
| max_tokens: Optional[int] = 16 | ||
| max_tokens: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last commenti: for alignment with the OpenAI's completion API shouldn't the max number of tokens be left as it was before (16)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I got it. The behavior of ChatCompletion and Completion are not the same. I will fix it.
I apologize for my carelessness. Maybe I need to have a rest😭.
|
I think with the kind help of @mspronesti, this pr is ready for review again @zhuohan123. |
|
Any news on your PR guys @mspronesti @HermitSun? It would be great to merge this. |
|
@chin-jey I think this pr is ready for review. I have requested a review from @zhuohan123. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for your contribution
This PR aims at showcasing how to use vLLM's OpenAI-compatible chat API. ### Context Lanchain already supports vLLM and its OpenAI-compatible `Completion` API. However, the `ChatCompletion` API was not aligned with OpenAI and for this reason I've waited for this [PR](vllm-project/vllm#852) to be merged before adding this notebook to langchain.
Add upper range for transformer version in commons for failing Test / test_lazy_outlines and Test / test_guided_generate --------- Co-authored-by: Michael Goin <michael@neuralmagic.com>
…ject#852) Implement save kv cache logic for v1 disaggregated prefill in ascend scheduler This PR adds support for saving kv cache in the ascend scheduler, which is part of the v1 disaggregated prefill design. The load functionality is not yet implemented. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
As OpenAI's API doc says,
max_tokensis an optional int param which defaults to inf. And since there is no official int inf in Python, we have two ways to trigger this param's default behavior:pydanticwill set it to16for us.vllm/vllm/entrypoints/openai/protocol.py
Lines 55 to 61 in 75c0ca9
None. This indicates we want to use the default value, aslangchaindoes. And the official OpenAI APIs support passingmax_tokens=Nonetoopenai.ChatCompletion.create.However, vLLM's OpenAI-compatible server will complain when passing
max_tokens=None.This pr try to align this behavior with openai. If
max_tokens=Noneis passed, we can set it to model's max_length and skip the length check.