Align `max_tokens` behavior with openai #852

HermitSun · 2023-08-24T12:39:13Z

As OpenAI's API doc says, max_tokens is an optional int param which defaults to inf. And since there is no official int inf in Python, we have two ways to trigger this param's default behavior:

Simply not pass this param when requests, and pydantic will set it to 16 for us.

vllm/vllm/entrypoints/openai/protocol.py

Lines 55 to 61 in 75c0ca9

    
           class ChatCompletionRequest(BaseModel): 
        
               model: str 
        
               messages: Union[str, List[Dict[str, str]]] 
        
               temperature: Optional[float] = 0.7 
        
               top_p: Optional[float] = 1.0 
        
               n: Optional[int] = 1 
        
               max_tokens: Optional[int] = 16

Explicitly pass a None. This indicates we want to use the default value, as langchain does. And the official OpenAI APIs support passing max_tokens=None to openai.ChatCompletion.create.

However, vLLM's OpenAI-compatible server will complain when passing max_tokens=None.

This pr try to align this behavior with openai. If max_tokens=None is passed, we can set it to model's max_length and skip the length check.

CZT0

I tested the submitted code and it works fine

HermitSun · 2023-09-05T06:50:17Z

cc @zhuohan123

zhuohan123

Hi @HermitSun! Thanks for your contribution. Left a small question about why one specific if needs to be changed.

zhuohan123 · 2023-09-08T20:37:07Z

vllm/entrypoints/openai/api_server.py

+    unlimited_tokens = request.max_tokens == max_model_len
+    if not unlimited_tokens and token_num + request.max_tokens > max_model_len:


Why does this if needs to be changed?

If None is directly passed to request.max_tokens, this if statement will complains something like TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'. So I set request.max_tokens=max_model_len to avoid this type mismatch.

And after request.max_tokens=max_model_len, token_num+max_model_len will always be larger than max_model_len. So I add a unlimited_tokens condition to try to avoid this.

In fact, this change is just for the special case that we pass max_tokens=None in the request for a unlimited number of tokens during generation.

And actually this code may not be able to differ the situation mentioned above from the case we pass max_tokens=max_model_len in the request. Maybe there is a better way, like set a large enough num to represent the inf and reset it to the max available num?

Can't you just set request.max_tokens to max_model_len - token_num if request.max_tokens is None so that you only allow the model to generate, at maximum, the number of tokens remaining before filling the context window?

@@ async def check_length ... if request.max_tokens is None: request.max_tokens = max_model_len - token_num elif token_num + request.max_tokens > max_model_len: return input_ids, create_error_response( HTTPStatus.BAD_REQUEST, f"This model's maximum context length is {max_model_len} tokens. " f"However, you requested {request.max_tokens + token_num} tokens " f"({token_num} in the messages, " f"{request.max_tokens} in the completion). " f"Please reduce the length of the messages or completion.", ) return input_ids, None

Thank you for the reminder. This solution is better. I have revised my code.

mspronesti · 2023-09-10T09:33:03Z

vllm/entrypoints/openai/api_server.py

+    unlimited_tokens = request.max_tokens == max_model_len
+    if not unlimited_tokens and token_num + request.max_tokens > max_model_len:


Can't you just set request.max_tokens to max_model_len - token_num if request.max_tokens is None so that you only allow the model to generate, at maximum, the number of tokens remaining before filling the context window?

@@ async def check_length ... if request.max_tokens is None: request.max_tokens = max_model_len - token_num elif token_num + request.max_tokens > max_model_len: return input_ids, create_error_response( HTTPStatus.BAD_REQUEST, f"This model's maximum context length is {max_model_len} tokens. " f"However, you requested {request.max_tokens + token_num} tokens " f"({token_num} in the messages, " f"{request.max_tokens} in the completion). " f"Please reduce the length of the messages or completion.", ) return input_ids, None

mspronesti · 2023-09-10T09:33:40Z

vllm/entrypoints/openai/api_server.py

+    if request.max_tokens is None:
+        request.max_tokens = max_model_len


These lines wouldn't be needed in the scenario I proposed above.

Thank you for the reminder. This solution is better. I will revise my code.

mspronesti · 2023-09-10T15:58:29Z

vllm/entrypoints/openai/protocol.py

    prompt: Union[List[int], List[List[int]], str, List[str]]
    suffix: Optional[str] = None
-    max_tokens: Optional[int] = 16
+    max_tokens: Optional[int] = None


One last commenti: for alignment with the OpenAI's completion API shouldn't the max number of tokens be left as it was before (16)?

Ah, I got it. The behavior of ChatCompletion and Completion are not the same. I will fix it.

I apologize for my carelessness. Maybe I need to have a rest😭.

HermitSun · 2023-09-10T16:21:27Z

I think with the kind help of @mspronesti, this pr is ready for review again @zhuohan123.

chin-jey · 2023-09-16T11:29:15Z

Any news on your PR guys @mspronesti @HermitSun? It would be great to merge this.

HermitSun · 2023-09-17T02:01:58Z

@chin-jey I think this pr is ready for review. I have requested a review from @zhuohan123.

zhuohan123

LGTM! Thanks for your contribution

This PR aims at showcasing how to use vLLM's OpenAI-compatible chat API. ### Context Lanchain already supports vLLM and its OpenAI-compatible `Completion` API. However, the `ChatCompletion` API was not aligned with OpenAI and for this reason I've waited for this [PR](vllm-project/vllm#852) to be merged before adding this notebook to langchain.

Add upper range for transformer version in commons for failing Test / test_lazy_outlines and Test / test_guided_generate --------- Co-authored-by: Michael Goin <michael@neuralmagic.com>

…ject#852) Implement save kv cache logic for v1 disaggregated prefill in ascend scheduler This PR adds support for saving kv cache in the ascend scheduler, which is part of the v1 disaggregated prefill design. The load functionality is not yet implemented. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

fix: align max_tokens behavior with openai

447324a

zhuohan123 force-pushed the main branch from 3affdce to 0080d83 Compare August 30, 2023 09:26

HermitSun added 2 commits September 4, 2023 11:56

Merge branch 'main' into main

3bcd256

chore: use None as default

a6984f5

CZT0 approved these changes Sep 5, 2023

View reviewed changes

zhuohan123 reviewed Sep 8, 2023

View reviewed changes

mspronesti reviewed Sep 10, 2023

View reviewed changes

refactor: avoid redundant assignment

857d6fc

mspronesti reviewed Sep 10, 2023

View reviewed changes

revert: recover Completion api's behavior

d9cd852

mspronesti approved these changes Sep 10, 2023

View reviewed changes

HermitSun requested a review from zhuohan123 September 16, 2023 02:22

zhuohan123 approved these changes Sep 24, 2023

View reviewed changes

zhuohan123 merged commit bbbf865 into vllm-project:main Sep 24, 2023

mspronesti mentioned this pull request Sep 24, 2023

docs: add vLLM chat notebook langchain-ai/langchain#10993

Merged

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Align max_tokens behavior with openai (vllm-project#852)

e812dcd

	class ChatCompletionRequest(BaseModel):
	model: str
	messages: Union[str, List[Dict[str, str]]]
	temperature: Optional[float] = 0.7
	top_p: Optional[float] = 1.0
	n: Optional[int] = 1
	max_tokens: Optional[int] = 16

		unlimited_tokens = request.max_tokens == max_model_len
		if not unlimited_tokens and token_num + request.max_tokens > max_model_len:

		if request.max_tokens is None:
		request.max_tokens = max_model_len

Uh oh!

Align max_tokens behavior with openai #852

Align max_tokens behavior with openai #852

Uh oh!

Conversation

HermitSun commented Aug 24, 2023

Uh oh!

CZT0 left a comment

Choose a reason for hiding this comment

Uh oh!

HermitSun commented Sep 5, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Sep 8, 2023

Choose a reason for hiding this comment

Uh oh!

HermitSun Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HermitSun Sep 9, 2023

Choose a reason for hiding this comment

Uh oh!

mspronesti Sep 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HermitSun Sep 10, 2023

Choose a reason for hiding this comment

Uh oh!

mspronesti Sep 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mspronesti Sep 10, 2023

Choose a reason for hiding this comment

Uh oh!

HermitSun Sep 10, 2023

Choose a reason for hiding this comment

Uh oh!

mspronesti Sep 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HermitSun Sep 10, 2023

Choose a reason for hiding this comment

Uh oh!

HermitSun commented Sep 10, 2023

Uh oh!

chin-jey commented Sep 16, 2023

Uh oh!

HermitSun commented Sep 17, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Align `max_tokens` behavior with openai #852

Align `max_tokens` behavior with openai #852

HermitSun Sep 9, 2023 •

edited

Loading

mspronesti Sep 10, 2023 •

edited

Loading

mspronesti Sep 10, 2023 •

edited

Loading

mspronesti Sep 10, 2023 •

edited

Loading