-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][doc] Update kvcache part #7549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][doc] Update kvcache part #7549
Conversation
📝 WalkthroughWalkthroughDocumentation updates clarifying KV cache configuration and retention semantics, adding KvCacheRetentionConfig API examples, reformatting code blocks, and removing/relaxing speculative-decoding constraints (overlap scheduler auto-disabled for two-model setups). Minor wording and capitalization edits across related docs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant L as LLM API
participant E as Engine
participant K as KV Cache Manager
U->>L: llm.generate(prompts, kv_cache_config, kv_cache_retention_config)
L->>E: initialize/dispatch with configs
E->>K: allocate/tag KV blocks (respect kv_cache_config)
rect rgb(220,235,255)
note over K: Apply retention policy<br/>token-range priorities & default priority
E->>K: tag blocks with priorities/durations
end
alt Reuse eligible
E->>K: reuse high-priority blocks
else Eviction needed
K-->K: evict lower-priority blocks first
end
E-->>L: token stream
L-->>U: generated output
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
94a6315 to
9b44c22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (8)
docs/source/features/speculative-decoding.md (3)
188-189: Grammar: add article and tighten phrasing.
“Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically.” → improve clarity.-Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically. +Two‑model–based speculation implementations do not support the overlap scheduler; it is disabled automatically.
43-44: Make overlap-scheduler guidance consistent with the “auto-disabled” statement.
Examples still pass disable_overlap_scheduler=True unconditionally. Either remove it for two‑model setups or gate it on eagle3_one_model.-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True) +llm = LLM("/path/to/target_model", speculative_config=speculative_config)-# Only need to disable overlap scheduler if eagle3_one_model is False. -llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True) +# Disable overlap scheduler only for the two‑model path. +llm = LLM("/path/to/target_model", + speculative_config=speculative_config, + disable_overlap_scheduler=not eagle3_one_model)-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True) +llm = LLM("/path/to/target_model", speculative_config=speculative_config)Also consider updating the YAML example (Lines 134–140) to reflect this nuance.
Also applies to: 64-66, 86-87
228-233: Typos and grammar in acceptance description.
Fix missing word and “drat” typo.-Currently, only greedy sampling is supported for speculative decoding. A draft token is accepted if -matches the previously decoded token exactly. +Currently, only greedy sampling is supported for speculative decoding. A draft token is accepted if +it matches the previously decoded token exactly. @@ -`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are drat tokens. +`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are draft tokens.docs/source/examples/kvcacheconfig.md (2)
3-3: Inline code formatting: don’t include “argument” inside backticks.-Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. +Set KV cache behavior by providing the optional `kv_cache_config` argument when you create the LLM engine.
3-3: Consistency: “quickstart” vs. “quick start”.
Use one form throughout (“quick start” is used elsewhere).-Consider the quickstart example found in ```examples/pytorch/quickstart.py```: +Consider the quick start example found in `examples/pytorch/quickstart.py`:-This code disables block reuse for the quick start example. +This code disables block reuse for the quick start example.(Apply the same spelling uniformly across the docs.)
Also applies to: 47-47
docs/source/examples/kvcacheretentionconfig.md (2)
31-31: Clarify token-range bounds (inclusive/exclusive).
You say “first four tokens” and pass (0, 4). If end is exclusive, that’s correct; please state it explicitly to avoid ambiguity.-Assume you know that the first four tokens of each prompt represent a system prompt ... +Assume you know that the first four tokens of each prompt (token indices 0–3; end index is exclusive) represent a system prompt ...
49-56: Variable naming style in example.
Prefer snake_case for Python variables.- tokenRangeRetentionConfig = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None) - kv_cache_retention_config = KvCacheRetentionConfig( - token_range_retention_configs=[tokenRangeRetentionConfig], + token_range_retention_config = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None) + kv_cache_retention_config = KvCacheRetentionConfig( + token_range_retention_configs=[token_range_retention_config],docs/source/features/kvcache.md (1)
45-46: Class name casing.
UseKvCacheConfig(matching the API and link target), not “KVCacheConfig”.-Many of the features in the KV cache system are optional or have user defined properties that alter how they work. Users can control KV cache features through class [KVCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig). +Many of the features in the KV cache system are optional or have user‑defined properties that alter how they work. Users can control KV cache features through class [KvCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
docs/source/examples/kvcacheconfig.md(2 hunks)docs/source/examples/kvcacheretentionconfig.md(4 hunks)docs/source/features/kvcache.md(2 hunks)docs/source/features/speculative-decoding.md(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
docs/source/examples/kvcacheretentionconfig.mddocs/source/features/kvcache.md
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
docs/source/features/kvcache.md
🪛 LanguageTool
docs/source/examples/kvcacheretentionconfig.md
[grammar] ~31-~31: There might be a mistake here.
Context: ... 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priori...
(QB_NEW_EN)
[grammar] ~31-~31: There might be a mistake here.
Context: ...00, where 100 is highest priority and 1 is lowest priority. Assume you know that t...
(QB_NEW_EN)
docs/source/features/speculative-decoding.md
[grammar] ~188-~188: There might be a mistake here.
Context: ...ased speculation implementations do not support overlap scheduler. It will be disabled ...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/features/kvcache.md (1)
33-34: Cross‑doc consistency: reuse + speculative decoding.
This now claims “reuse across requests is supported by all speculative decoding models.” Please verify against the current backend behavior, and align examples in speculative‑decoding.md (which still set disable_overlap_scheduler=True in two‑model examples).
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
9b44c22 to
7a7f006
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (11)
docs/source/features/speculative-decoding.md (5)
188-188: Polish grammar: add article and hyphenation.Suggested: “Two-model-based speculation implementations do not support the overlap scheduler; it will be disabled automatically.”
-Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically. +Two-model-based speculation implementations do not support the overlap scheduler; it will be disabled automatically.
43-44: Align examples with auto-disable behavior: remove redundant flag.Since two-model setups auto-disable overlap scheduling, drop the explicit
disable_overlap_scheduler=Trueto avoid confusion.-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True) +llm = LLM("/path/to/target_model", speculative_config=speculative_config)
65-66: Make EAGLE 3 snippet reflect conditional need.Either omit the flag entirely (recommended) or show it conditionally only when using two-model. Example below removes it for clarity.
-# Only need to disable overlap scheduler if eagle3_one_model is False. -llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True) +# Two-model setups auto-disable overlap scheduling. +llm = LLM("/path/to/target_model", speculative_config=speculative_config)
86-87: NGram example: dropdisable_overlap_scheduler=True.Auto-disable applies to two-model algorithms; keep the example minimal.
-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True) +llm = LLM("/path/to/target_model", speculative_config=speculative_config)
134-140: YAML sample: removedisable_overlap_scheduleror note it’s auto/ignored.To reduce user confusion, either delete the key or add a brief comment that it’s auto-disabled for two-model setups.
-disable_overlap_scheduler: true speculative_config: decoding_type: Eagle max_draft_len: 4 speculative_model: /path/to/draft/modeldocs/source/examples/kvcacheconfig.md (3)
3-3: Inline code formatting: remove triple backticks and “argument” from code span.Use single backticks and keep prose outside the code span.
-Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. Consider the quickstart example found in ```examples/pytorch/quickstart.py```: +Set KV cache behavior by providing the optional `kv_cache_config` argument when you create the LLM engine. Consider the quick start example found in `examples/pytorch/quickstart.py`:
31-31: Consistent terminology: “quick start” (two words).Matches usage elsewhere in the docs.
-You can reduce this value to 0.7 by adding the following lines to the quickstart example: +You can reduce this value to 0.7 by adding the following lines to the quick start example:
39-39: Inline code formatting for class name.Use single backticks, not triple.
-You can also set properties after you create ```KvCacheConfig```. For example: +You can also set properties after you create `KvCacheConfig`. For example:docs/source/examples/kvcacheretentionconfig.md (3)
3-3: Inline code formatting: use single backticks.Applies to both occurrences on this line.
-You can change block priority by providing the optional ```kv_cache_retention_config``` argument when you submit a request to the LLM engine. Consider the quick start example found in ```examples/pytorch/quickstart.py```: +You can change block priority by providing the optional `kv_cache_retention_config` argument when you submit a request to the LLM engine. Consider the quick start example found in `examples/pytorch/quickstart.py`:
49-56: PEP 8 naming + clarify token range bounds.
- Use snake_case for variables in Python examples.
- Please clarify whether the
endindex is inclusive or exclusive to prevent off-by-one errors.- tokenRangeRetentionConfig = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None) - kv_cache_retention_config = KvCacheRetentionConfig( - token_range_retention_configs=[tokenRangeRetentionConfig], + token_range_retention_config = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None) + kv_cache_retention_config = KvCacheRetentionConfig( + token_range_retention_configs=[token_range_retention_config], decode_retention_priority=35, # Set generated tokens to default priority decode_duration_ms=None)Follow-up: If the constructor expects a half-open range [start, end), consider adding a note like “Indices are [start, end) (end exclusive).”
68-68: Inline code formatting: single backticks.-This example uses a single ```kv_cache_retention_config``` object for all the prompts. You can also provide a list that must have the same length as the list of prompts. +This example uses a single `kv_cache_retention_config` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
docs/source/examples/kvcacheconfig.md(2 hunks)docs/source/examples/kvcacheretentionconfig.md(4 hunks)docs/source/features/kvcache.md(2 hunks)docs/source/features/speculative-decoding.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/source/features/kvcache.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
docs/source/examples/kvcacheretentionconfig.md
🪛 LanguageTool
docs/source/examples/kvcacheretentionconfig.md
[grammar] ~31-~31: There might be a mistake here.
Context: ... 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priori...
(QB_NEW_EN)
[grammar] ~31-~31: There might be a mistake here.
Context: ...00, where 100 is highest priority and 1 is lowest priority. Assume you know that t...
(QB_NEW_EN)
docs/source/features/speculative-decoding.md
[grammar] ~188-~188: There might be a mistake here.
Context: ...ased speculation implementations do not support overlap scheduler. It will be disabled ...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
|
/bot run |
|
PR_Github #17758 [ run ] triggered by Bot |
|
PR_Github #17758 [ run ] completed with state |
|
/bot skip --comment "docs only change" |
|
PR_Github #17759 [ skip ] triggered by Bot |
|
PR_Github #17759 [ skip ] completed with state |
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Cherry-pick #7382 into 1.0 branch
Summary by CodeRabbit
New Features
Documentation