KEMBAR78
Align `max_tokens` behavior with openai by HermitSun · Pull Request #852 · vllm-project/vllm · GitHub
Skip to content

Conversation

@HermitSun
Copy link
Contributor

As OpenAI's API doc says, max_tokens is an optional int param which defaults to inf. And since there is no official int inf in Python, we have two ways to trigger this param's default behavior:

  1. Simply not pass this param when requests, and pydantic will set it to 16 for us.
    class ChatCompletionRequest(BaseModel):
    model: str
    messages: Union[str, List[Dict[str, str]]]
    temperature: Optional[float] = 0.7
    top_p: Optional[float] = 1.0
    n: Optional[int] = 1
    max_tokens: Optional[int] = 16
  2. Explicitly pass a None. This indicates we want to use the default value, as langchain does. And the official OpenAI APIs support passing max_tokens=None to openai.ChatCompletion.create.

However, vLLM's OpenAI-compatible server will complain when passing max_tokens=None.

This pr try to align this behavior with openai. If max_tokens=None is passed, we can set it to model's max_length and skip the length check.

Copy link

@CZT0 CZT0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the submitted code and it works fine

@HermitSun
Copy link
Contributor Author

cc @zhuohan123

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HermitSun! Thanks for your contribution. Left a small question about why one specific if needs to be changed.

Comment on lines 132 to 133
unlimited_tokens = request.max_tokens == max_model_len
if not unlimited_tokens and token_num + request.max_tokens > max_model_len:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this if needs to be changed?

Copy link
Contributor Author

@HermitSun HermitSun Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If None is directly passed to request.max_tokens, this if statement will complains something like TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'. So I set request.max_tokens=max_model_len to avoid this type mismatch.

And after request.max_tokens=max_model_len, token_num+max_model_len will always be larger than max_model_len. So I add a unlimited_tokens condition to try to avoid this.

In fact, this change is just for the special case that we pass max_tokens=None in the request for a unlimited number of tokens during generation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And actually this code may not be able to differ the situation mentioned above from the case we pass max_tokens=max_model_len in the request. Maybe there is a better way, like set a large enough num to represent the inf and reset it to the max available num?

Copy link
Contributor

@mspronesti mspronesti Sep 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just set request.max_tokens to max_model_len - token_num if request.max_tokens is None so that you only allow the model to generate, at maximum, the number of tokens remaining before filling the context window?

@@ async def check_length
   ... 

    if request.max_tokens is None:
        request.max_tokens = max_model_len - token_num
    elif token_num + request.max_tokens > max_model_len:
        return input_ids, create_error_response(
            HTTPStatus.BAD_REQUEST,
            f"This model's maximum context length is {max_model_len} tokens. "
            f"However, you requested {request.max_tokens + token_num} tokens "
            f"({token_num} in the messages, "
            f"{request.max_tokens} in the completion). "
            f"Please reduce the length of the messages or completion.",
        )
    
    return input_ids, None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder. This solution is better. I have revised my code.

Comment on lines 132 to 133
unlimited_tokens = request.max_tokens == max_model_len
if not unlimited_tokens and token_num + request.max_tokens > max_model_len:
Copy link
Contributor

@mspronesti mspronesti Sep 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just set request.max_tokens to max_model_len - token_num if request.max_tokens is None so that you only allow the model to generate, at maximum, the number of tokens remaining before filling the context window?

@@ async def check_length
   ... 

    if request.max_tokens is None:
        request.max_tokens = max_model_len - token_num
    elif token_num + request.max_tokens > max_model_len:
        return input_ids, create_error_response(
            HTTPStatus.BAD_REQUEST,
            f"This model's maximum context length is {max_model_len} tokens. "
            f"However, you requested {request.max_tokens + token_num} tokens "
            f"({token_num} in the messages, "
            f"{request.max_tokens} in the completion). "
            f"Please reduce the length of the messages or completion.",
        )
    
    return input_ids, None

Comment on lines 193 to 194
if request.max_tokens is None:
request.max_tokens = max_model_len
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines wouldn't be needed in the scenario I proposed above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder. This solution is better. I will revise my code.

prompt: Union[List[int], List[List[int]], str, List[str]]
suffix: Optional[str] = None
max_tokens: Optional[int] = 16
max_tokens: Optional[int] = None
Copy link
Contributor

@mspronesti mspronesti Sep 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last commenti: for alignment with the OpenAI's completion API shouldn't the max number of tokens be left as it was before (16)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I got it. The behavior of ChatCompletion and Completion are not the same. I will fix it.

I apologize for my carelessness. Maybe I need to have a rest😭.

@HermitSun
Copy link
Contributor Author

I think with the kind help of @mspronesti, this pr is ready for review again @zhuohan123.

@chin-jey
Copy link

Any news on your PR guys @mspronesti @HermitSun? It would be great to merge this.

@HermitSun
Copy link
Contributor Author

@chin-jey I think this pr is ready for review. I have requested a review from @zhuohan123.

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for your contribution

@zhuohan123 zhuohan123 merged commit bbbf865 into vllm-project:main Sep 24, 2023
baskaryan pushed a commit to langchain-ai/langchain that referenced this pull request Sep 25, 2023

This PR aims at showcasing how to use vLLM's OpenAI-compatible chat API.

### Context
Lanchain already supports vLLM and its OpenAI-compatible `Completion`
API. However, the `ChatCompletion` API was not aligned with OpenAI and
for this reason I've waited for this
[PR](vllm-project/vllm#852) to be merged before
adding this notebook to langchain.
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
yma11 pushed a commit to yma11/vllm that referenced this pull request Feb 21, 2025
Add upper range for transformer version in commons for failing Test /
test_lazy_outlines and Test / test_guided_generate

---------

Co-authored-by: Michael Goin <michael@neuralmagic.com>
amy-why-3459 pushed a commit to amy-why-3459/vllm that referenced this pull request Sep 15, 2025
…ject#852)

Implement save kv cache logic for v1 disaggregated prefill in ascend
scheduler

This PR adds support for saving kv cache in the ascend scheduler, which
is part of the v1 disaggregated prefill design. The load functionality
is not yet implemented.

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants