-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371
Conversation
…mba cache memory estimation Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…ench Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…CacheCreator Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…-bench throughput command Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…MambaHybridCacheManager) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…Manager, and explicit call to MambaCacheManager and KVCacheManager functions in MambaHybridCacheManager to reduce confusion Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…esult of is_nemotron_hybrid to increase readability Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the estimation of maximum batch size and maximum token count for the KV cache when using hybrid models like Nemotron-H, ensuring that only attention layers are considered in the KV cache estimations and that mamba cache memory is also taken into account. Key changes include:
- Adjusting the byte-per-token calculation to count only attention layers using the hybrid override pattern.
- Propagating the kv_cache_gpu_mem_fraction CLI argument and applying a conservative adjustment for mamba hybrid models.
- Refactoring resource manager methods to better handle mamba cache blocks and releasing cache memory upon shutdown.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tensorrt_llm/bench/build/tuning.py | Adjustments to KV cache estimations and logging for hybrid models. |
| tensorrt_llm/bench/build/dataclasses.py | Adding hybrid_override_pattern and mamba_config fields to model configurations. |
| tensorrt_llm/bench/build/build.py | Propagation of kv_cache_gpu_mem_fraction into benchmark engine settings. |
| tensorrt_llm/bench/benchmark/utils/general.py | Passing the new CLI argument for kv_cache memory fraction. |
| tensorrt_llm/_torch/pyexecutor/resource_manager.py | Renaming and refactoring resource methods and adding a shutdown method for mamba cache release. |
| tensorrt_llm/_torch/pyexecutor/config_utils.py | Updating hybrid check logic using getattr. |
| tensorrt_llm/_torch/pyexecutor/_util.py | Adjusting the cache size calculation using attention layers in hybrid models. |
Comments suppressed due to low confidence (1)
tensorrt_llm/bench/build/tuning.py:95
- Consider adding an inline comment to explain the rationale behind squaring kv_cache_gpu_mem_fraction for mamba hybrid models, as it improves clarity on why a more conservative memory fraction is applied.
kv_cache_gpu_mem_fraction *= kv_cache_gpu_mem_fraction
|
/bot run |
|
PR_Github #9520 [ run ] triggered by Bot |
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
/bot run |
|
PR_Github #9532 [ run ] triggered by Bot |
|
PR_Github #9520 [ run ] completed with state |
|
PR_Github #9532 [ run ] completed with state |
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
/bot run |
|
PR_Github #9698 [ run ] triggered by Bot |
|
PR_Github #10055 [ run ] completed with state |
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…ba specific) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…ng -> disable_optimistic_tuning Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…delConfig and use it when estimating cache size per token Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
/bot run |
1 similar comment
|
/bot run |
|
PR_Github #11049 [ run ] triggered by Bot |
|
PR_Github #11049 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving since these changes from the trtllm-bench side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Great stuff! 💪
…mations for Nemotron-H (NVIDIA#5371) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Yuxin <yuxinz@nvidia.com>
Throughout the code there are a few places where the number of available kv cache tokens and/or maximum batch size are estimated. These estimations are based on the available free GPU memory and on the memory size of a single kv cache entry. These estimations didn't take into account hybrid models like Nemotron-H, in which not all layers are attention layers and require a kv cache.
Changes in this PR:
trtllm-bench throughputcommandtrtllm-bench throughputand inKvCacheCreatorkv_cache_gpu_mem_fractionfromtrtllm-bench throughputCLI arg to the function that estimates the maximum batch sizeMambaHybridCacheManageris shut down (+ a small refactor to increase readability ofMambaHybridCacheManager).