[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371

tomeras91 · 2025-06-19T12:03:35Z

Throughout the code there are a few places where the number of available kv cache tokens and/or maximum batch size are estimated. These estimations are based on the available free GPU memory and on the memory size of a single kv cache entry. These estimations didn't take into account hybrid models like Nemotron-H, in which not all layers are attention layers and require a kv cache.

Changes in this PR:

consider mamba cache for max batch size estimation in trtllm-bench throughput command
take only attention layers into account when estimating maximum number of tokens in kv cache, both in trtllm-bench throughput and in KvCacheCreator
propagate kv_cache_gpu_mem_fraction from trtllm-bench throughput CLI arg to the function that estimates the maximum batch size
release mamba cache memory when MambaHybridCacheManager is shut down (+ a small refactor to increase readability of MambaHybridCacheManager).

…mba cache memory estimation Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…ench Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…CacheCreator Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…-bench throughput command Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…MambaHybridCacheManager) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…Manager, and explicit call to MambaCacheManager and KVCacheManager functions in MambaHybridCacheManager to reduce confusion Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…esult of is_nemotron_hybrid to increase readability Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Copilot

Pull Request Overview

This PR fixes the estimation of maximum batch size and maximum token count for the KV cache when using hybrid models like Nemotron-H, ensuring that only attention layers are considered in the KV cache estimations and that mamba cache memory is also taken into account. Key changes include:

Adjusting the byte-per-token calculation to count only attention layers using the hybrid override pattern.
Propagating the kv_cache_gpu_mem_fraction CLI argument and applying a conservative adjustment for mamba hybrid models.
Refactoring resource manager methods to better handle mamba cache blocks and releasing cache memory upon shutdown.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tensorrt_llm/bench/build/tuning.py	Adjustments to KV cache estimations and logging for hybrid models.
tensorrt_llm/bench/build/dataclasses.py	Adding hybrid_override_pattern and mamba_config fields to model configurations.
tensorrt_llm/bench/build/build.py	Propagation of kv_cache_gpu_mem_fraction into benchmark engine settings.
tensorrt_llm/bench/benchmark/utils/general.py	Passing the new CLI argument for kv_cache memory fraction.
tensorrt_llm/_torch/pyexecutor/resource_manager.py	Renaming and refactoring resource methods and adding a shutdown method for mamba cache release.
tensorrt_llm/_torch/pyexecutor/config_utils.py	Updating hybrid check logic using getattr.
tensorrt_llm/_torch/pyexecutor/_util.py	Adjusting the cache size calculation using attention layers in hybrid models.

Comments suppressed due to low confidence (1)

tensorrt_llm/bench/build/tuning.py:95

Consider adding an inline comment to explain the rationale behind squaring kv_cache_gpu_mem_fraction for mamba hybrid models, as it improves clarity on why a more conservative memory fraction is applied.

        kv_cache_gpu_mem_fraction *= kv_cache_gpu_mem_fraction

tensorrt_llm/_torch/pyexecutor/_util.py

tomeras91 · 2025-06-19T14:18:55Z

/bot run

tensorrt-cicd · 2025-06-19T14:24:19Z

PR_Github #9520 [ run ] triggered by Bot

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 · 2025-06-19T16:15:14Z

/bot run

tensorrt-cicd · 2025-06-19T16:20:32Z

PR_Github #9532 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-19T16:20:34Z

PR_Github #9520 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-19T19:22:15Z

PR_Github #9532 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6994 completed with status: 'FAILURE'

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 · 2025-06-24T10:44:54Z

/bot run

tensorrt-cicd · 2025-06-24T10:50:05Z

PR_Github #9698 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-26T20:13:54Z

PR_Github #10055 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7422 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

tensorrt_llm/bench/build/dataclasses.py

tensorrt_llm/_torch/pyexecutor/_util.py

tensorrt_llm/_torch/pyexecutor/resource_manager.py

tensorrt_llm/bench/build/tuning.py

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…ba specific) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tensorrt_llm/bench/build/tuning.py

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…ng -> disable_optimistic_tuning Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…delConfig and use it when estimating cache size per token Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 · 2025-07-06T07:06:03Z

/bot run

tomeras91 · 2025-07-06T11:43:56Z

/bot run

tensorrt-cicd · 2025-07-06T11:49:17Z

PR_Github #11049 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-06T15:41:11Z

PR_Github #11049 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8169 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

FrankD412

Approving since these changes from the trtllm-bench side.

Naveassaf

LGTM.

Great stuff! 💪

…mations for Nemotron-H (NVIDIA#5371) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Yuxin <yuxinz@nvidia.com>

tomeras91 added 9 commits June 12, 2025 15:47

WIP: consider num_attention_layers for kv cache estimation and add ma…

1752239

…mba cache memory estimation Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Merge branch 'main' into fix-trtllm-bench-for-nemotron-h

7829ec9

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

organize code and logging for max batch size calculation for trtllm-b…

4403183

…ench Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

consider only attention layers when estimating number of tokens in Kv…

6ff4602

…CacheCreator Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

propagate kv_cache_gpu_mem_fraction to calc_engine_setting for trtllm…

e6615a8

…-bench throughput command Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

release mamba cache memory when shutting down MambaCacheManager (and …

42d65f3

…MambaHybridCacheManager) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

small refactor - MambaCacheManager method names to match BaseResource…

17d22e5

…Manager, and explicit call to MambaCacheManager and KVCacheManager functions in MambaHybridCacheManager to reduce confusion Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

refactor - is_nemotron_hybrid works on dicts as well

7dfeab8

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

remove log

ee85bac

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 requested a review from a team as a code owner June 19, 2025 12:03

tomeras91 requested review from Naveassaf, Copilot and schetlur-nv June 19, 2025 12:03

This comment was marked as outdated.

Sign in to view

Add comment explaining squaring of kv_cache_gpu_mem_fraction + save r…

d0d0b7e

…esult of is_nemotron_hybrid to increase readability Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 requested a review from Copilot June 19, 2025 12:41

This comment was marked as outdated.

Sign in to view

remove debug print

63bea92

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 requested a review from Copilot June 19, 2025 12:45

Copilot AI reviewed Jun 19, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/_util.py Outdated Show resolved Hide resolved

fix - use config.get() only if config is a dict

c8c71df

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Merge branch 'main' into fix-trtllm-bench-for-nemotron-h

3e6a30e

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

FrankD412 reviewed Jun 27, 2025

View reviewed changes

tensorrt_llm/bench/build/dataclasses.py Outdated Show resolved Hide resolved

Naveassaf reviewed Jun 29, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/_util.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/resource_manager.py Show resolved Hide resolved

tensorrt_llm/bench/build/tuning.py Outdated Show resolved Hide resolved

tomeras91 added 5 commits July 2, 2025 16:39

introduce NemotronHybridConfig that inherits from ModelConfig

7904672

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Move logic to compute extra model class to ModelConfig class

04cba88

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

refactor max batch size estimation and make it more general (less mam…

337e7aa

…ba specific) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

remove redundant MambaConfig

4b0182b

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

simplify computation of total kv cache memory

ea4e816

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

FrankD412 reviewed Jul 2, 2025

View reviewed changes

tensorrt_llm/bench/build/tuning.py Outdated Show resolved Hide resolved

FrankD412 reviewed Jul 2, 2025

View reviewed changes

tensorrt_llm/bench/build/tuning.py Outdated Show resolved Hide resolved

tomeras91 added 5 commits July 2, 2025 19:01

remove whitespace

1975d38

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

compute cache memory fraction in ModelConfig + enable_optimistic_tuni…

1670ad9

…ng -> disable_optimistic_tuning Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

reduce formatting diff

3e40792

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Merge branch 'main' into fix-trtllm-bench-for-nemotron-h

0a0d2c8

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Add get_num_attention_layers() function in _torch/model_config.py::Mo…

47b9eb8

…delConfig and use it when estimating cache size per token Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 requested a review from a team as a code owner July 3, 2025 09:42

tomeras91 requested review from HuiGao-NV and yilin-void July 3, 2025 09:42

FrankD412 approved these changes Jul 7, 2025

View reviewed changes

Naveassaf approved these changes Jul 9, 2025

View reviewed changes

Naveassaf merged commit 5aa958a into NVIDIA:main Jul 9, 2025
3 checks passed

tomeras91 deleted the fix-trtllm-bench-for-nemotron-h branch July 9, 2025 09:21

tomeras91 restored the fix-trtllm-bench-for-nemotron-h branch July 9, 2025 09:22

venkywonka mentioned this pull request Jul 25, 2025

fix: Tuning method for VSWA models in trtllm-bench #6273

Open

tomeras91 mentioned this pull request Jul 27, 2025

Add disable_optimistic_tuning flag and update gb_per_token calculation title #6349

Closed

[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371

[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H #5371

Uh oh!

Conversation

tomeras91 commented Jun 19, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

tomeras91 commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tomeras91 commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

tomeras91 commented Jun 24, 2025

Uh oh!

tensorrt-cicd commented Jun 24, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomeras91 commented Jul 6, 2025

Uh oh!

tomeras91 commented Jul 6, 2025

Uh oh!

tensorrt-cicd commented Jul 6, 2025

Uh oh!

tensorrt-cicd commented Jul 6, 2025

Uh oh!

FrankD412 left a comment

Choose a reason for hiding this comment

Uh oh!

Naveassaf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants