KEMBAR78
[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H by tomeras91 · Pull Request #5371 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@tomeras91
Copy link
Collaborator

Throughout the code there are a few places where the number of available kv cache tokens and/or maximum batch size are estimated. These estimations are based on the available free GPU memory and on the memory size of a single kv cache entry. These estimations didn't take into account hybrid models like Nemotron-H, in which not all layers are attention layers and require a kv cache.

Changes in this PR:

  • consider mamba cache for max batch size estimation in trtllm-bench throughput command
  • take only attention layers into account when estimating maximum number of tokens in kv cache, both in trtllm-bench throughput and in KvCacheCreator
  • propagate kv_cache_gpu_mem_fraction from trtllm-bench throughput CLI arg to the function that estimates the maximum batch size
  • release mamba cache memory when MambaHybridCacheManager is shut down (+ a small refactor to increase readability of MambaHybridCacheManager).

…mba cache memory estimation

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…ench

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…CacheCreator

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…-bench throughput command

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…MambaHybridCacheManager)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…Manager, and explicit call to MambaCacheManager and KVCacheManager functions in MambaHybridCacheManager to reduce confusion

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@tomeras91 tomeras91 requested a review from a team as a code owner June 19, 2025 12:03
Copilot

This comment was marked as outdated.

…esult of is_nemotron_hybrid to increase readability

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@tomeras91 tomeras91 requested a review from Copilot June 19, 2025 12:41
Copilot

This comment was marked as outdated.

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@tomeras91 tomeras91 requested a review from Copilot June 19, 2025 12:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the estimation of maximum batch size and maximum token count for the KV cache when using hybrid models like Nemotron-H, ensuring that only attention layers are considered in the KV cache estimations and that mamba cache memory is also taken into account. Key changes include:

  • Adjusting the byte-per-token calculation to count only attention layers using the hybrid override pattern.
  • Propagating the kv_cache_gpu_mem_fraction CLI argument and applying a conservative adjustment for mamba hybrid models.
  • Refactoring resource manager methods to better handle mamba cache blocks and releasing cache memory upon shutdown.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tensorrt_llm/bench/build/tuning.py Adjustments to KV cache estimations and logging for hybrid models.
tensorrt_llm/bench/build/dataclasses.py Adding hybrid_override_pattern and mamba_config fields to model configurations.
tensorrt_llm/bench/build/build.py Propagation of kv_cache_gpu_mem_fraction into benchmark engine settings.
tensorrt_llm/bench/benchmark/utils/general.py Passing the new CLI argument for kv_cache memory fraction.
tensorrt_llm/_torch/pyexecutor/resource_manager.py Renaming and refactoring resource methods and adding a shutdown method for mamba cache release.
tensorrt_llm/_torch/pyexecutor/config_utils.py Updating hybrid check logic using getattr.
tensorrt_llm/_torch/pyexecutor/_util.py Adjusting the cache size calculation using attention layers in hybrid models.
Comments suppressed due to low confidence (1)

tensorrt_llm/bench/build/tuning.py:95

  • Consider adding an inline comment to explain the rationale behind squaring kv_cache_gpu_mem_fraction for mamba hybrid models, as it improves clarity on why a more conservative memory fraction is applied.
        kv_cache_gpu_mem_fraction *= kv_cache_gpu_mem_fraction

@tomeras91
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9520 [ run ] triggered by Bot

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@tomeras91
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9532 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9520 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9532 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6994 completed with status: 'FAILURE'

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@tomeras91
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9698 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10055 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7422 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

tomeras91 added 5 commits July 2, 2025 16:39
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…ba specific)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
tomeras91 added 5 commits July 2, 2025 19:01
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…ng -> disable_optimistic_tuning

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…delConfig and use it when estimating cache size per token

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@tomeras91 tomeras91 requested a review from a team as a code owner July 3, 2025 09:42
@tomeras91 tomeras91 requested review from HuiGao-NV and yilin-void July 3, 2025 09:42
@tomeras91
Copy link
Collaborator Author

/bot run

1 similar comment
@tomeras91
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11049 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11049 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8169 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Copy link
Collaborator

@FrankD412 FrankD412 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since these changes from the trtllm-bench side.

Copy link
Collaborator

@Naveassaf Naveassaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Great stuff! 💪

@Naveassaf Naveassaf merged commit 5aa958a into NVIDIA:main Jul 9, 2025
3 checks passed
@tomeras91 tomeras91 deleted the fix-trtllm-bench-for-nemotron-h branch July 9, 2025 09:21
@tomeras91 tomeras91 restored the fix-trtllm-bench-for-nemotron-h branch July 9, 2025 09:22
zhou-yuxin pushed a commit to zhou-yuxin/TensorRT-LLM that referenced this pull request Jul 15, 2025
…mations for Nemotron-H (NVIDIA#5371)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Yuxin <yuxinz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants