-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse #6767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
1-3
: Update copyright year to include 2025Per coding guidelines, OSS code should include the current year. Header shows 2022-2024 while the .cpp is 2025.
-/* - * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. +/* + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.
17-17
: Use header include guards instead of#pragma once
Guideline requires preprocessor guard in headers. Keep
#pragma once
if you like, but add the guard.#pragma once +#ifndef TRTLLM_KVCACHEMANAGER_H +#define TRTLLM_KVCACHEMANAGER_H ... -} // namespace tensorrt_llm::batch_manager::kv_cache_manager +} // namespace tensorrt_llm::batch_manager::kv_cache_manager + +#endif // TRTLLM_KVCACHEMANAGER_H
1383-1396
: Fix parameter name typo:copyOnpartialReuse
→copyOnPartialReuse
Guideline: parameter names should be consistent between declaration and definition. This is also a typo.
- bool enablePartialReuse = true, bool copyOnpartialReuse = true); + bool enablePartialReuse = true, bool copyOnPartialReuse = true); ... - bool enablePartialReuse = true, bool copyOnpartialReuse = true); + bool enablePartialReuse = true, bool copyOnPartialReuse = true); ... - bool enablePartialReuse = true, bool copyOnpartialReuse = true); + bool enablePartialReuse = true, bool copyOnPartialReuse = true); ... - bool enablePartialReuse = true, bool copyOnpartialReuse = true); + bool enablePartialReuse = true, bool copyOnPartialReuse = true);Also applies to: 1403-1414
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
131-136
: Bug: wrong multimodal start offset within block
mmStartInBlock
is reversed. For content starting inside the block, the offset should be (startPos - startTokenIdx); otherwise 0.- SizeType32 mmStartInBlock = (startPos >= startTokenIdx) ? 0 : startTokenIdx - startPos; + SizeType32 mmStartInBlock = (startPos >= startTokenIdx) ? (startPos - startTokenIdx) : 0;
🧹 Nitpick comments (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
549-550
: Document the new parameterisShareLastContextBlock
Add a short note explaining when the last context block can be shared across beams.
-//! \brief Assign blocks for new sequence. Does not try to reuse blocks. +//! \brief Assign blocks for new sequence. Does not try to reuse blocks. +//! \param isShareLastContextBlock If true, the last context block is shared across beams (e.g., when fully filled).
879-880
: Doc parity for BlockManager overloadMirror the parameter documentation here to keep interfaces consistent.
-void addSequence( - GenerationRequest& sequence, SizeType32 numContextBlocks, SizeType32 windowSize, bool isShareLastContextBlock); +//! \param isShareLastContextBlock If true, share the last context block across beams for this window size. +void addSequence(GenerationRequest& sequence, SizeType32 numContextBlocks, SizeType32 windowSize, + bool isShareLastContextBlock);cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
1194-1204
: Comment improvementThese comments are helpful. Consider also noting the conditions that select each overload (reuse enabled vs disabled) to avoid drift.
1864-1881
: cacheNewBlockOffset: LGTM with a tiny guard suggestionOptional: early-continue if a beam has zero blocks to avoid
.back()
on empty vectors.- auto const blockId = beamCacheBlock.back(); + if (beamCacheBlock.empty()) continue; + auto const blockId = beamCacheBlock.back();
1949-1951
: Comment/code mismatch: temporaryAttentionWindow not usedThe comment says we consider
temporaryAttentionWindow
, buteffectiveInputLength
ignores it.
- Either remove the comment or incorporate it into the calculation (and document the rule).
- Example if intended:
effectiveInputLength = std::min(inputLength + temporaryAttentionWindow, maxTokenNum);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(13 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧠 Learnings (3)
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🔇 Additional comments (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
1128-1133
: New caching APIs look goodCentralizing offset caching in BlockManager simplifies KVCacheManager. Interface and comments are clear.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (6)
1844-1862
: cacheSequenceBlockOffsets: LGTMOffset materialization is correct and respects per-pool layout.
1918-1920
: Assertion matches PR scopeGood to hard-stop unsupported streamLLM.
1994-1995
: Offset caching on addSequenceCaching offsets at the end of sequence allocation is correct and keeps copyBlockOffsets O(1) per batch slot.
1807-1819
: Assertion aligns with streamLLM pruning; context blocks calculationUsing floor division for
numContextBlocks
is fine with streamLLM disabled. Make sure this aligns with howaddSequence
computes context blocks (ceilDiv) to avoid off-by-one in heterogeneous paths.Would you like me to generate a small unit test sketch to compare
getRemainingBlocksToCompletion
vscalculateMaxBlockRequirements
for representative inputs?
1882-1906
: Fix duplicate declaration and enforce beamWidth assertion
- Remove the duplicate
auto& sequence = getSequence(requestId);
(only one declaration needed after the sink‐check).- Replace the warning on
beamWidth > 1
with a hard assertion:TLLM_CHECK_WITH_INFO( sequence.getBeamWidth() == 1, "[KV cache manager] addToken does not support beamWidth > 1" );
- Simplify the window boundary check to:
(theif (tokensInWindow % getTokensPerBlock() == 0)tokensInWindow <= maxTokensInWindow
check is always true by construction).Proposed diff:
void KVCacheManager::addToken(RequestIdType requestId) { - auto& sequence = getSequence(requestId); TLLM_CHECK_WITH_INFO( mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0, "streamLLM is not supported at the moment" ); - auto& sequence = getSequence(requestId); - if (sequence.getBeamWidth() > 1) - { - TLLM_LOG_WARNING("[KV cache manager] addToken does not support beamWidth > 1"); - } + auto& sequence = getSequence(requestId); + TLLM_CHECK_WITH_INFO( + sequence.getBeamWidth() == 1, + "[KV cache manager] addToken does not support beamWidth > 1" + ); sequence.addNewTokens(1); for (auto const [windowSize, metadata] : mBlockManager.getWindowSizesMetadata()) { SizeType32 const maxTokensInWindow = metadata.maxTokenNum; SizeType32 const tokensInWindow = sequence.getNumTokens() % maxTokensInWindow; - // NOTE: If beam search is involved, added blocks should be shared. - // Please check out TODO(file MR number here) for utility function pruned. - if (tokensInWindow % getTokensPerBlock() == 0 && tokensInWindow <= maxTokensInWindow) + // Allocate a new block on each block boundary within the window. + if (tokensInWindow % getTokensPerBlock() == 0) { mBlockManager.allocateBlock(sequence, windowSize); mBlockManager.cacheNewBlockOffset(sequence, windowSize); } } }Please confirm whether the “one extra block” behavior on exact window rollover (when
tokensInWindow == 0
) is intended. If not, change the condition to:if (tokensInWindow > 0 && tokensInWindow % getTokensPerBlock() == 0)
2319-2339
: Compile error and missing assertion in removeTokenSame issues as
addToken
: duplicatesequence
declaration and lack of assertion for beamWidth > 1.void KVCacheManager::removeToken(RequestIdType requestId) { - auto& sequence = getSequence(requestId); TLLM_CHECK_WITH_INFO( mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0, "streamLLM is not supported at the moment"); - auto& sequence = getSequence(requestId); - if (sequence.getBeamWidth() > 1) - { - TLLM_LOG_WARNING("[KV cache manager] removeToken does not support beamWidth > 1"); - } + auto& sequence = getSequence(requestId); + TLLM_CHECK_WITH_INFO(sequence.getBeamWidth() == 1, "[KV cache manager] removeToken does not support beamWidth > 1"); if (sequence.getNumTokens() == 0) { return; } sequence.removeTokens(1); for (auto const [windowSize, metadata] : mBlockManager.getWindowSizesMetadata()) { SizeType32 const maxTokensInWindow = metadata.maxTokenNum; SizeType32 const tokensInWindow = sequence.getNumTokens() % maxTokensInWindow; if (tokensInWindow % getTokensPerBlock() == 0 && tokensInWindow <= maxTokensInWindow) { mBlockManager.releaseLastBlock(sequence, windowSize); } } }⛔ Skipped due to learnings
Learnt from: zhengd-nv PR: NVIDIA/TensorRT-LLM#6633 File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155 Timestamp: 2025-08-06T08:18:28.669Z Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
0202754
to
2e74a61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
1207-1218
: Guard against underflow whennumContextBlocks == 0
If
SizeType32
is unsigned,numContextBlocks - 1
underflows. Safer to compute the shared count explicitly.- for (SizeType32 bi = 0; bi < numContextBlocks - 1; ++bi) + SizeType32 sharedCount = (numContextBlocks > 0) ? (numContextBlocks - 1) : 0; + for (SizeType32 bi = 0; bi < sharedCount; ++bi) { allocateBlock(sequence, /*shareAmongBeams=*/true); } - allocateBlock(sequence, /*shareAmongBeams=*/isShareLastContextBlock); + if (numContextBlocks > 0) { + allocateBlock(sequence, /*shareAmongBeams=*/isShareLastContextBlock); + }
1881-1884
: Use assertion instead of warning for unsupported beam-searchPR intent is to prune unsupported paths with assertions. Replace warning with TLLM_CHECK to avoid undefined behavior later in allocation paths.
- if (sequence.getBeamWidth() > 1) - { - TLLM_LOG_WARNING("[KV cache manager] Beam search is not supported at the moment"); - } + TLLM_CHECK_WITH_INFO(sequence.getBeamWidth() == 1, "[KV cache manager] Beam search is not supported at the moment");
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
1829-1853
: Update TODO comment with actual MR referenceLine 1845 contains a placeholder TODO comment that should be updated with the actual merge request number before merging.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(12 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧬 Code Graph Analysis (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (14)
addSequence
(1151-1155)addSequence
(1151-1152)addSequence
(1159-1194)addSequence
(1159-1160)addSequence
(1198-1202)addSequence
(1198-1199)addSequence
(1206-1218)addSequence
(1206-1207)addSequence
(1862-1929)addSequence
(1862-1863)cacheSequenceBlockOffsets
(1791-1809)cacheSequenceBlockOffsets
(1791-1791)cacheNewBlockOffset
(1811-1827)cacheNewBlockOffset
(1811-1811)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (7)
llmRequest
(421-450)llmRequest
(421-421)llmRequest
(872-903)llmRequest
(872-872)llmRequest
(905-912)requestId
(200-208)requestId
(200-201)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (14)
requestId
(93-96)requestId
(93-93)requestId
(98-103)requestId
(98-99)requestId
(105-109)requestId
(105-106)requestId
(111-114)requestId
(111-111)requestId
(116-119)requestId
(116-116)requestId
(152-155)requestId
(152-152)requestId
(174-179)requestId
(174-175)cpp/tests/executor/encDecTest.cpp (1)
req
(224-225)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
548-548
: Good API simplificationThe new signature with
isShareLastContextBlock
boolean parameter is clearer and more intuitive than the previousunsharedBlockIdx
approach.
865-865
: Consistent API updateThe signature change maintains consistency with WindowBlockManager's addSequence method.
1099-1103
: Good architectural refactoringMoving the offset caching methods from KVCacheManager to BlockManager improves encapsulation by placing the functionality closer to where the block data is managed.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
1149-1155
: Clear separation of block reuse pathsGood practice to have separate overloads with clear comments distinguishing the block reuse enabled vs disabled paths.
1752-1789
: Good simplification of block calculationThe removal of temporaryAttentionWindow and sink bubble length from the calculation makes the logic clearer and aligns with the PR's goal of removing streaming LLM support.
1791-1827
: Correct implementation of offset caching methodsThe new BlockManager methods properly delegate to the appropriate WindowBlockManager, maintaining good separation of concerns.
2239-2263
: Consistent implementation with addTokenThe removeToken implementation properly mirrors the addToken logic, maintaining consistency in handling block boundaries and unsupported features.
7ed955e
to
2a54634
Compare
2a54634
to
d04b8ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, although I do have a few comments/questions:
-
You say StreamingLLM (i.e. sink tokens) and beam search are not supported with SWA..
1.1. What broke this support?
1.2. You changed the unittests to reflect the fact they aren't supported. I assume this means these unittests passed before your changes (i.e. with sink tokens and beam search). So how did these unittests pass if it's not supported? -
I asked in the PR comments as well.. is beam search really not supported at all? That's what it seems from the assertions you added, but you wrote in the description it's not supported only for SWA...
BTW - Great description! It was very helpful for reviewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (3)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (3)
289-315
: Unreachable code after premature exit in TestBufferIndexAssignment1Child branch calls exit at Line 289, making the subsequent recv-buffer test (Lines 291–314) unreachable. Remove the early exit and keep the final exit at Line 315.
Apply this minimal diff:
- exit(testing::Test::HasFailure() ? 1 : 0); + // Continue with recv buffer index assignment test; final exit occurs below.
381-383
: Fix format specifier for size_t in TLLM_LOG_INFOdefaultTransSize is a size-like value; using "%d" is incorrect on LP64 and can cause UB. Use "%zu".
- TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize); + TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));
18-26
: Include required POSIX headers for fork and waitpidThis file uses fork() and waitpid() but does not include their headers explicitly. Add these to ensure correct declarations across platforms/compilers.
#include "tensorrt_llm/runtime/iTensor.h" #include <gtest/gtest.h> #include <memory> +// For fork(), waitpid() +#include <unistd.h> +#include <sys/types.h> +#include <sys/wait.h>
🧹 Nitpick comments (13)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
38-41
: Beam width forced to 1: consistent with current constraintsSetting maxBeamWidth to 1 with a clear TODO is aligned with the PR’s temporary limitation. Please add a tracking issue reference in the TODO for discoverability.
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2)
87-89
: Beam width TODO acknowledgedThe TODO to add coverage for beamWidth > 1 is appropriate given current limitations. Consider adding a tracking ticket reference for traceability.
1-11
: Update copyright headerPer repo guidelines, OSS source should include the current year. This file was modified; update “2023-2024” to include 2025.
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (4)
407-411
: Avoid double-disabling a testUsing both DISABLED_ prefix and GTEST_SKIP() is redundant. Keep a single mechanism (prefer DISABLED_ for long-term disable, or GTEST_SKIP with TODO if temporary).
-TEST_F(LlmRequestTest, DISABLED_testLastTokensSetIndependence) +TEST_F(LlmRequestTest, DISABLED_testLastTokensSetIndependence) { - // TODO: Support and add coverage for beamWidth > 1 - GTEST_SKIP(); + // TODO: Support and add coverage for beamWidth > 1 + // Disabled via DISABLED_ prefix above.
515-515
: Typo in local variable name: beamWdith → beamWidthMinor naming fix for consistency/readability.
- auto const beamWdith = std::get<3>(info.param); + auto const beamWidth = std::get<3>(info.param); ... - name += "Bw" + std::to_string(beamWdith); + name += "Bw" + std::to_string(beamWidth);
753-761
: Streaming parameter reduced to false: OK with TODOConstraining streaming values to {false} is consistent with current gating. Consider adding a reference to a tracking issue in the TODO comment.
1-11
: Update copyright headerPlease include the current year in the header (add 2025).
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
456-458
: Sink-token coverage gated to 0: OK; add tracking reference and keep code ready to re-enableSwitching sinkTokenLen variants from {0, 4} to {0} with TODO is consistent with current constraints. Two suggestions:
- Add a tracking issue ID to the TODO comment so the missing 4-case coverage isn’t lost.
- Keep the plumbing (loops/param values) but guard with a single flag or a constexpr to simplify re-enabling later.
Also applies to: 511-512, 556-557, 601-602, 711-713, 837-844, 1040-1041, 1093-1095, 1187-1188, 1520-1522, 1572-1574, 1853-1855
1-11
: Update copyright headerPlease update to include 2025.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (4)
120-123
: Beam width > 1 temporarily gated: good; add tracking referencesMultiple tests now fix beamWidth/maxBeamWidth at 1 with TODOs. This is consistent with the current limitation. Please add links to a tracking issue next to the TODOs for future re-enablement.
Also applies to: 172-177, 2372-2374, 2436-2437, 3665-3666, 3908-3911, 4151-4153, 4257-4259
3502-3507
: Avoid double-disabling and fix typo
- The test is both DISABLED_ and calls GTEST_SKIP(); remove one (prefer DISABLED_ here).
- Typo in TODO: “attetnion” → “attention”.
-TEST_P(KVCacheManagerTest, DISABLED_KVCacheManagerSinkTokenLengthTest) +TEST_P(KVCacheManagerTest, DISABLED_KVCacheManagerSinkTokenLengthTest) { - // TODO: Support sink attetnion and add coverage - // TODO: Support and add coverage for beamWidth > 1 - GTEST_SKIP(); + // TODO: Support sink attention and add coverage + // TODO: Support and add coverage for beamWidth > 1 + // Disabled via DISABLED_ prefix.
3796-3799
: Prefer test-level gating over runtime checksThis helper asserts beamWidth == 1 via TLLM_CHECK. Since callers already gate beam tests, consider replacing this with a gtest ASSERT/ASSUME inside the test or relying on the existing gating to avoid hard aborts.
- TLLM_CHECK(beamWidth == 1); + ASSERT_EQ(beamWidth, 1);
1-11
: Update copyright headerPlease update header to include 2025 as per guidelines.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(10 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(13 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(19 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code Graph Analysis (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (14)
addSequence
(1151-1155)addSequence
(1151-1152)addSequence
(1159-1194)addSequence
(1159-1160)addSequence
(1198-1202)addSequence
(1198-1199)addSequence
(1206-1222)addSequence
(1206-1207)addSequence
(1867-1926)addSequence
(1867-1868)cacheSequenceBlockOffsets
(1799-1817)cacheSequenceBlockOffsets
(1799-1799)cacheNewBlockOffset
(1819-1835)cacheNewBlockOffset
(1819-1819)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
llmRequest
(70-70)llmRequest
(70-70)cpp/tests/batch_manager/cacheTransceiverTest.cpp (5)
llmRequest
(421-450)llmRequest
(421-421)llmRequest
(872-903)requestId
(200-208)requestId
(200-201)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (14)
requestId
(93-96)requestId
(93-93)requestId
(98-103)requestId
(98-99)requestId
(105-109)requestId
(105-106)requestId
(111-114)requestId
(111-111)requestId
(116-119)requestId
(116-116)requestId
(152-155)requestId
(152-152)requestId
(174-179)requestId
(174-175)
🔇 Additional comments (11)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
140-141
: New BlockManager::addSequence signature usage looks correctCalls updated to pass maxAttentionWindow and the isShareLastContextBlock flag. Assumptions and expectations (occupiedBlocks and sharing behavior) are consistent with beamWidth = 1. LGTM.
Also applies to: 155-165, 168-178, 181-187
2771-2774
: Sliding-window expectations updatedThe free block counts after addToken steps align with cyclic/window semantics. Looks correct for the specified maxAttentionWindow scenario.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
548-548
: Good API simplificationThe change from
unsharedBlockIdx
toisShareLastContextBlock
makes the API clearer and more intuitive.
865-865
: Consistent API designThe BlockManager signature change mirrors the WindowBlockManager change, maintaining consistency across the API.
1099-1103
: Good architectural refactoringMoving offset caching methods to BlockManager improves separation of concerns by centralizing block-related operations in the appropriate class.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (6)
1149-1151
: Good documentation for overloaded methodsClear comments distinguishing between block reuse enabled/disabled versions improve code maintainability.
Also applies to: 1157-1158
1799-1835
: Clean per-window routing implementationThe offset caching methods properly delegate to window-specific managers, maintaining good separation of concerns.
1837-1858
: Proper guards for unsupported featuresGood use of assertions to guard against unsupported streamLLM and beam search features, with helpful comments explaining the current limitations.
1867-1926
: Well-structured sequence addition with proper window handlingThe implementation correctly calculates effective input length per window and determines block sharing appropriately. Good assertion placement for unsupported features.
2253-2261
: Correct per-window block release logicThe implementation properly handles block release when crossing block boundaries, respecting per-window constraints.
1196-1197
: Clear method documentationComments clearly distinguish between the block reuse enabled and disabled versions of the methods.
Also applies to: 1204-1205
@tomeras91 Thank you for the swift reviewal.
The recent non-cyclic kernel changes don't support them. #6379
This has been a problem because the coverage is not full and even though the specific functionality tested in the unit tests do work as expected, overall the feature is not ready (or shall I say back to un-supported). So right now I prefer to first prune them out to not further complicate computation on SWA reuse. As long as we keep clear context and fully guard it uses, we can revisit this in the future. Also, you can find that many tests under the UT file only covers
Good question. What it seems to me now is that the KV cache manager manages all attentions with a "window size" specified. Since we are planning to support SWA KV cache reuse to grow linearly just like a full attention, the breakage for "beam search with SWA" also implies the breakage for "beam search with an infinite window, that is, full attention". Beam search was added before SWA concept was introduced into KV cache manager, and I think except some UT-s we see here, we don't have other e2e coverage of it. So as it seems not ideal now, it is borderline-acceptable for me to prune out the feature and revisit it in the near future. |
d04b8ee
to
60b1e17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (13)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
38-40
: Constrain beamWidth to 1 for now — aligned with guarded supportLGTM. Matches PR scope. Consider linking a tracking ticket in the TODO for re-enabling beamWidth > 1 coverage to avoid lingering tech debt.
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (4)
456-458
: Sink token coverage reduced to {0} with TODO — ok; suggest centralizingReducing sinkTokenLen to {0} is consistent with temporarily guarding unsupported paths. To reduce duplication and ease re-enablement later, consider a single helper (e.g., a function or a test fixture field) that returns the currently-supported set and documents the TODO once.
Would you like a script to list remaining tests still attempting sinkTokenLen > 0 so we can ensure consistency across the suite?
Also applies to: 510-512, 555-557, 598-602, 711-713, 1039-1041, 1093-1095, 1186-1188, 1372-1374, 1520-1522, 1572-1574, 1853-1855
837-844
: Configurations pared down to supported caseTrimming to MAX_UTILIZATION with sinkTokenLen=0 and explicit expected iterations keeps this test stable during the guard period. If you intend to restore others later, consider adding a reference to a tracking issue (besides the TODO above) to avoid forgetting the removed cases.
3502-3507
: Redundant DISABLED_ and GTEST_SKIP()This test is already disabled via the DISABLED_ prefix; GTEST_SKIP() won’t execute. Drop the skip call to avoid unreachable code.
-TEST_P(KVCacheManagerTest, DISABLED_KVCacheManagerSinkTokenLengthTest) -{ - // TODO: Support sink attetnion and add coverage - // TODO: Support and add coverage for beamWidth > 1 - GTEST_SKIP(); +TEST_P(KVCacheManagerTest, DISABLED_KVCacheManagerSinkTokenLengthTest) +{ + // TODO: Support sink attention and add coverage + // TODO: Support and add coverage for beamWidth > 1
4151-4152
: Global TODOs for beamWidth>1 and sink attentionConsider linking a tracking issue ID next to these TODOs to make the intent actionable and visible in dashboards.
Also applies to: 4257-4258
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (2)
407-411
: Remove redundant GTEST_SKIP() from a DISABLED_ testThe DISABLED_ prefix already prevents execution. Drop GTEST_SKIP() to keep the body clean.
-TEST_F(LlmRequestTest, DISABLED_testLastTokensSetIndependence) -{ - // TODO: Support and add coverage for beamWidth > 1 - GTEST_SKIP(); +TEST_F(LlmRequestTest, DISABLED_testLastTokensSetIndependence) +{ + // TODO: Support and add coverage for beamWidth > 1
751-766
: Param space narrowed (streaming=false, beamWidth=1)LGTM. Matches the temporary guard. Consider adding the tracking issue ID to the TODO comments so this can be expanded back later with traceability.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (6)
168-178
: Expectation updates for addSequence capacity and duplicate request IDsLGTM. The EXPECT_NO_THROW/EXPECT_THROW combinations are coherent. The inline TODO on beam width clarifies the temporary limitation.
3797-3799
: Explicit guard: TLLM_CHECK(beamWidth == 1)Reasonable defensive check for the temporary state. When re-enabling multi-beam, remember to remove this or convert to a parameterized constraint.
2362-2374
: Misc: tests narrowed to maxBeamWidth=1Multiple tests enforce maxBeamWidth=1 with TODOs. To minimize repetition, you can centralize this in the test fixture or a shared header constant so you only have to change it once when re-enabling multi-beam.
Also applies to: 2436-2437, 3665-3666
3502-3507
: Duplicate skip in DISABLED_ test (sink token length)As noted in capacitySchedulerTest, DISABLED_ already prevents execution. Consider removing GTEST_SKIP() here as well (if introduced in future diffs).
4151-4152
: Global TODOs for beamWidth>1 and sink attentionSuggest linking a tracking issue to keep this visible and actionable to the team.
Also applies to: 4257-4258
2679-2682
: Comment clarityMinor nit: “Enable cyclic kv cache for all new generated tokens” comment could be rephrased for accuracy if the window semantics are not fully cyclic for the entire sequence. Not a blocker.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(10 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(13 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(19 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code Graph Analysis (3)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
cpp/tests/utils/common.h (1)
beamWidth
(214-214)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (14)
addSequence
(1151-1155)addSequence
(1151-1152)addSequence
(1159-1194)addSequence
(1159-1160)addSequence
(1198-1202)addSequence
(1198-1199)addSequence
(1206-1222)addSequence
(1206-1207)addSequence
(1867-1927)addSequence
(1867-1868)cacheSequenceBlockOffsets
(1799-1817)cacheSequenceBlockOffsets
(1799-1799)cacheNewBlockOffset
(1819-1835)cacheNewBlockOffset
(1819-1819)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (5)
llmRequest
(421-450)llmRequest
(421-421)llmRequest
(872-903)requestId
(200-208)requestId
(200-201)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (14)
requestId
(93-96)requestId
(93-93)requestId
(98-103)requestId
(98-99)requestId
(105-109)requestId
(105-106)requestId
(111-114)requestId
(111-111)requestId
(116-119)requestId
(116-116)requestId
(152-155)requestId
(152-152)requestId
(174-179)requestId
(174-175)cpp/tests/utils/common.h (1)
beamWidth
(214-214)
🔇 Additional comments (22)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
548-548
: LGTM: Clear API documentation for addSequence variantsThe updated comment clearly distinguishes between the two addSequence overloads, improving API clarity.
864-865
: Verify consistent parameter naming across overloadsThe new
isShareLastContextBlock
parameter is well-named and makes the sharing behavior explicit. However, ensure this parameter is consistently used across all related methods.
1099-1104
: New offset caching methods properly centralize functionalityThe addition of
cacheSequenceBlockOffsets
andcacheNewBlockOffset
methods to BlockManager follows good separation of concerns by moving implementation-specific caching from KVCacheManager to the appropriate manager class.cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (9)
1149-1151
: Clear documentation improves code maintainabilityThe comments clearly distinguish between the two addSequence overloads and their use cases (block reuse enabled vs disabled), improving code readability.
1196-1202
: Proper delegation to per-window managersThe implementation correctly delegates to the appropriate WindowBlockManager based on windowSize, maintaining the per-window routing architecture.
1206-1222
: Block allocation logic correctly implements sharing behaviorThe implementation properly allocates blocks with the correct sharing semantics:
- Lines 1214-1217: Allocate shared blocks for all but the last context block
- Lines 1218-1221: Allocate the last block with the specified sharing behavior
1799-1817
: Offset caching properly centralized in BlockManagerThe
cacheSequenceBlockOffsets
method correctly:
- Routes to the appropriate WindowBlockManager
- Handles multiple beams and blocks
- Maintains the existing caching interface
1819-1835
: Single block offset caching implementation is correctThe
cacheNewBlockOffset
method properly handles caching for just the newest block, complementing the full sequence caching method.
1837-1858
: Proper safeguarding against unsupported featuresThe assertions correctly guard against:
- StreamLLM functionality (sink tokens)
- Beam search with width > 1
This aligns with the PR's goal of pruning unsupported paths.
1870-1885
: Consistent guard implementation in addSequenceThe same safeguarding pattern is properly applied in addSequence, ensuring consistent behavior across the API.
1911-1915
: Correct sharing logic and offset cachingThe logic properly:
- Determines sharing behavior based on cross-KV, cyclic state, and beam width
- Routes through the new per-window addSequence API
- Caches offsets using the new centralized method
2242-2259
: Complete implementation of removeToken with proper guardsThe removeToken method correctly:
- Guards against unsupported features consistently
- Implements per-window token removal logic
- Maintains the block management invariants
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (3)
2436-2437
: Force maxBeamWidth=1 in DISABLED_KVCacheManagerAllocationTestLGTM. Matches the temporary guard. No action needed beyond tracking re-enable work.
3958-3960
: Beam width explicitly set to 1 in reference/capacity testsLGTM. This keeps the math and expectations coherent while multi-beam is guarded.
Also applies to: 3969-3971
2679-2682
: Adjusted free block expectations around addTokenThe updated expectations reflect the first addToken not allocating a new block but the second one does. This aligns with the cyclic window semantics exercised by the test. LGTM.
If you want an extra sanity check, I can generate a one-off assertion helper to compute expected free blocks from test parameters.
Also applies to: 2771-2774
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (7)
120-123
: Force beamWidth=1 in BlockManagerTestLGTM. Keeps the test aligned with temporarily unsupported multi-beam behavior.
140-141
: Adopt new BlockManager::addSequence signatureCorrect use of the new 4th parameter (isShareLastContextBlock) in both false and true variants. Good coverage of both paths.
Also applies to: 155-156
2679-2682
: MaxAttentionWindow test: updated expectations after addTokenThe first addToken not changing free blocks and the second reducing by maxBeamWidth matches the intended semantics here. LGTM.
Also applies to: 2771-2774
2777-2787
: Batch test: free block accounting expectations updatedAfter addSequence + addToken, expectations reflect only shared+one leaf allocation per request. This mirrors the updated semantics. LGTM.
3956-3966
: Reference helpers (calculateMax): beamWidth set to 1 in tests*Temporarily constraining beamWidth to 1 is fine. The comments correctly mark follow-ups. No further issues.
Also applies to: 3967-3981, 3983-4007
4257-4467
: FillKvCacheAndCompleteRequests param set: constrained but representativeParameter sets cover diverse prompt/output sizes and windows with reuse on/off, while keeping beamWidth = 1. Good coverage given current constraints.
Also applies to: 4469-4548
3797-3854
: neededBlocksOneStep tests: guard + expectationsUsing TLLM_CHECK(beamWidth == 1) and validating used blocks against getNeededBlocksOneStep is coherent and keeps the test meaningful even with the constraint. LGTM.
Also applies to: 3861-3891
While it is true that we can think of full attention is SWA with infinite window, there is still an important difference. If the window size is equal to the maximum sequence length supported by the model, we're not actually in SWA. This is a fundamentally different scenario than actual SWA where the attention window is smaller than the maximum sequence length. Beam search is a pretty widely used feature and I would be very surprised if it is currently broken on
|
PR_Github #14823 [ run ] completed with state |
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
295-297
: Bug: wrong container used in EXPECT_NE for disagg-gen-init scheduled setThe check for scheduledDisaggGenInitRequests erroneously searches in scheduledRequests, making the assertion ineffective.
Apply this diff to fix the container lookup:
- EXPECT_NE( - scheduledRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end()) + EXPECT_NE( + scheduledDisaggGenInitRequests.find(scheduledReqState.mRequestId), + scheduledDisaggGenInitRequests.end())
🧹 Nitpick comments (11)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (11)
452-453
: Track the narrowed sink-token coverage with a concrete referenceAcknowledged narrowing to sinkTokenLen = 0. Consider linking the TODO to the root-cause PR/issue (e.g., PR #6379 context) so it’s actionable when re-enabling.
504-508
: Cross-blocks test: same sink-token gating commentSame note: adding an issue/PR reference to the TODO will make the intent traceable.
551-552
: Lora duplicate task: add reference to TODOLink the TODO to the tracked breakage for easier re-enablement later.
596-598
: Lora doesn’t fit: add reference to TODOSame as above; a ticket/PR reference helps future follow-up.
707-709
: Max utilization: add reference to TODOPlease link to the tracked item for sinkTokenLen > 0 so this doesn’t get lost.
1090-1091
: Cross-blocks negative-fit: add reference to TODOSame suggestion: link the TODO to the underlying issue/PR.
1182-1184
: Adding new requests (max util.): add reference to TODOConsider attaching an issue/PR ID to the TODO for sink tokens.
1368-1370
: Adding new requests with priorities: add reference to TODOSame note about linking TODOs for clarity.
1516-1518
: Guaranteed completion (adding new requests): add reference to TODOPlease link TODO to a tracking item.
1568-1570
: Guaranteed completion in chunk: add reference to TODOSame recommendation to link TODO to an issue/PR.
1849-1851
: Static batch: add reference to TODOSame note here; a concrete reference will help re-enable coverage later.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(13 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
833-841
: Thanks for documenting removed configurationsThe added TODO and explicit “Removed configuration” notes make the coverage contraction clear. This addresses prior tracking concerns.
PR_Github #15413 [ run ] triggered by Bot |
PR_Github #15412 [ run ] completed with state |
PR_Github #15413 [ run ] completed with state |
5b847f4
to
2add619
Compare
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
872-879
: Nice docstring addition for the new overloadThe parameter docs clarify intent and match implementation in the .cpp. This addresses the prior request to document the new overload.
🧹 Nitpick comments (7)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (7)
1157-1194
: Reuse path looks correct; minor guard optionalThe reuse variant ignores the last token for hashing and pads an empty block when needed, which aligns with prefill semantics. If you want an early fail, consider asserting inputLength > 0 at function entry, though callers likely already ensure it.
Would you like me to add a TLLM_CHECK guard at entry?
1247-1251
: Nit: clarify the boolean literal with a named-argument commentFor consistency with nearby calls that use /shareAmongBeams=/ comments, consider annotating this call as well.
Apply this minimal change for readability:
- mWindowBlockManagers.at(windowSize).allocateBlock(sequence, false); + mWindowBlockManagers.at(windowSize).allocateBlock(sequence, /*shareAmongBeams=*/false);
1847-1866
: Use BlockManager::setOffsets wrapper for consistencyThese new helpers are great. You already provide BlockManager::setOffsets(..., windowSize); prefer using it instead of reaching into mWindowBlockManagers directly. Keeps the abstraction clean and consistent across the file.
Apply:
- mWindowBlockManagers.at(windowSize).setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId); + setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId, windowSize);
1867-1884
: Same as above: route via BlockManager::setOffsetsMirror the change here to use the existing wrapper.
- mWindowBlockManagers.at(windowSize).setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId); + setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId, windowSize);
1885-1901
: Same as above: route via BlockManager::setOffsetsMirror the change here to use the existing wrapper.
- mWindowBlockManagers.at(windowSize).setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId); + setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId, windowSize);
1902-1927
: Comment vs behavior clarity for beam search and SWAThe comment says “shared block is replaced with unshared ones,” but in this PR the addToken path allocates unshared blocks by design. replaceSharedBlock becomes a no-op unless a shared block exists from prefill/full-attention flows. Consider clarifying the comment to avoid confusion.
- // For beam search, shared block is replaced with unshared ones + // For beam search: if a previously shared context block exists (e.g. from prefill/full-attention), + // replace it with unshared ones when sliding past the window. No-op if blocks are already unshared.
2311-2330
: Simplify condition for releasing last block on token removaltokensInWindow is modulo maxTokensInWindow, so “<= maxTokensInWindow” is always true. You can simplify the condition.
- if (tokensInWindow % getTokensPerBlock() == 0 && tokensInWindow <= maxTokensInWindow) + if (tokensInWindow % getTokensPerBlock() == 0) { mBlockManager.releaseLastBlock(sequence, windowSize); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(10 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧠 Learnings (6)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (5)
1149-1156
: Good separation: per-window reuse routing moved into BlockManager/WindowBlockManagerThe split clearly delineates the reuse path and delegates appropriately to the per-window manager. No issues spotted.
1196-1203
: API simplification: non-reuse addSequence wrapper is cleanThe new signature with isShareLastContextBlock is straightforward and matches the updated call sites.
1204-1219
: Edge cases handled: single-block and unsigned underflow are guarded
- numContextBlocks > 0 asserted
- Loop allocates numContextBlocks - 1 shared blocks, then allocates the last with the caller’s sharing choice
This addresses the earlier underflow/single-block concerns.
1959-1987
: Non-reuse path logic is sound; sharing rule reads clearlyThe isShareLastContextBlock decision captures cross-KV, cyclic single-beam, and “full last block” cases. The subsequent per-window offset update centralizes the canonical offsets state. Looks good.
1651-1653
: Intentional guard: streamLLM (sink attention) disabled via constructor assertionThis matches PR objectives to prune unsupported paths now. Good fail-fast location.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
554-555
: Header signature reflects the non-reuse path accuratelyGood to expose the simplified non-reuse addSequence with explicit isShareLastContextBlock.
1114-1122
: Public, window-aware offset update APIs are clear and cohesiveThe trio of updateSequenceCacheBlockOffsets/updateLastCacheBlockOffsets/updateCacheBlockOffsetsAtIdx improves readability of callers and encapsulates per-window routing.
PR_Github #15575 [ run ] triggered by Bot |
PR_Github #15575 [ run ] completed with state |
…nager::updateToken` into `addToken` and `removeToken` Given that streamLLM is broken now, removed related code logic and added assertion to guard for sinkAttention settings. This allows us to revisit the computation in the future. Main reasons of this change: Token addition and removal logic should be decoupled. Flattens callstack and nested-if. std::swap use is bizarre. Broken features should be guarded safely. Rename cache offset bookkeeping utility for clearance. These utilities are pushed downwards from `KVCacheManager` to `BlockManager`. - cacheBlockOffsets` --> `updateSequenceCacheBlockOffsets` - cacheNewBlockOffsets` --> `updateLastCacheBlockOffsets` - updateNewBlockPointer` --> `updateCacheBlockOffsetsAtIdx` Comments have been added to test cases for reminder of future test coverage on feature support. Signed-off-by: eopXD <yuehtingc@nvidia.com>
…e context to distinguish `BlockManager::addSequence` and `WindowBlockManager::addSequence` Signed-off-by: eopXD <yuehtingc@nvidia.com>
…hared last context block logic under `KVCacheManager::addSequence` The last context block is shared when: - Operating on a cyclic kv cache and with no beam search - Operating on cross kv cache - Last context block is full Prune incorrect comment left under `KVCacheManager::getNeededBlocksOneStep`. No need to complicate things. Signed-off-by: eopXD <yuehtingc@nvidia.com>
…ache block allocation on token addition for cross kv manager For an encode-decoder model, generation tokens are added only to the decoder and not the encoder. Signed-off-by: eopXD <yuehtingc@nvidia.com>
2add619
to
7e56c2c
Compare
/bot run --disable-fail-fast |
PR_Github #15697 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
872-879
: Doc added for BlockManager::addSequence is on point; thanks.This addresses the earlier request to document the new overload and its parameters.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
2664-2664
: BeamWidth>1 and sink-attention TODOs: consider explicit test gating.As previously suggested, prefer GTEST_SKIP() guards (with a short rationale) over TODO comments so CI visibly tracks feature gaps without marking tests “DISABLED” or leaving TODOs. That said, keeping current coverage intact is fine for now.
Also applies to: 4136-4138, 4242-4244
🧹 Nitpick comments (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
553-555
: Clarify parameter semantics in WindowBlockManager::addSequence (no-reuse path).The brief says “Does not try to reuse blocks” but the signature adds isShareLastContextBlock. To avoid confusion, clarify that “sharing” here is intra-request (across beams), not cross-request reuse.
Apply this diff to extend the comment:
- //! \brief Assign blocks for new sequence. Does not try to reuse blocks. + //! \brief Assign blocks for a new sequence without cross-request reuse. + //! \param sequence The GenerationRequest to process. + //! \param numContextBlocks Number of context blocks to allocate per beam. + //! \param isShareLastContextBlock If true, share the last context block across beams (not across requests). void addSequence(GenerationRequest& sequence, SizeType32 numContextBlocks, bool isShareLastContextBlock);cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
3497-3501
: Disabled sink-token-length test is an acceptable interim gate.Given sink-attention is temporarily unsupported, disabling the suite is reasonable. When re-enabling, prefer parameterized GTEST_SKIP() over DISABLED_ to keep compilation paths active and intent visible.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
1157-1194
: Guard against inputLength == 0 to avoid unsigned underflow
inputLength - 1
underflows ifinputLength == 0
(SizeType32 is unsigned). Add a precondition or compute safely.Apply this diff:
void WindowBlockManager::addSequence( GenerationRequest& sequence, SizeType32 inputLength, SizeType32 numContextBlocks, LlmRequest& llmRequest) { + TLLM_CHECK_WITH_INFO(inputLength > 0, "inputLength must be greater than 0"); auto const requestId = sequence.getRequestId(); auto const [seqIt, emplaceDone] = mAllocatedBlocksPerSeq.emplace(requestId, std::vector<BlockPtr>{}); TLLM_CHECK(emplaceDone); @@ - // Ignore last token because it can't be recovered - auto blockedUniqueTokens = chopVectorIntoBlocks<UniqueToken>(uniqueTokens, inputLength - 1, mTokensPerBlock, true); + // Ignore last token because it can't be recovered + auto blockedUniqueTokens = chopVectorIntoBlocks<UniqueToken>(uniqueTokens, inputLength - 1, mTokensPerBlock, true);
1651-1652
: Constructor guard for streamLLM is good; consider richer diagnosticAsserting once in the ctor is the right place. Consider including the resolved sink lengths in the message for faster diagnostics.
Apply this diff:
- TLLM_CHECK_WITH_INFO(mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0, - "[kv cache manager] streamLLM is not supported at the moment"); + TLLM_CHECK_WITH_INFO(mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0, + "[kv cache manager] streamLLM is not supported at the moment (sinkBlockTokenLength=%d, sinkBubbleLength=%d)", + static_cast<int>(mSinkBlockTokenLength), static_cast<int>(mSinkBubbleLength));
2313-2330
: removeToken: redundant condition and minor simplification
tokensInWindow <= maxTokensInWindow
is always true by construction. Can be dropped for clarity.Apply this diff:
- SizeType32 const maxTokensInWindow = metadata.maxTokenNum; - SizeType32 const tokensInWindow = sequence.getNumTokens() % maxTokensInWindow; - if (tokensInWindow % getTokensPerBlock() == 0 && tokensInWindow <= maxTokensInWindow) + SizeType32 const maxTokensInWindow = metadata.maxTokenNum; + SizeType32 const tokensInWindow = sequence.getNumTokens() % maxTokensInWindow; + if (tokensInWindow % getTokensPerBlock() == 0) { mBlockManager.releaseLastBlock(sequence, windowSize); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(10 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(13 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(11 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
- cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
1114-1122
: BlockManager offset APIs implemented and legacy helpers removed
- Confirmed implementations in
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
:
•BlockManager::updateSequenceCacheBlockOffsets
at line 1847
•BlockManager::updateLastCacheBlockOffsets
at line 1867
•BlockManager::updateCacheBlockOffsetsAtIdx
at line 1885- Verified usage in
KVCacheManager
at lines 1916 and 1986- Deprecated helpers (
cacheBlockOffsets
,cacheNewBlockOffsets
,updateNewBlockPointer
,updateToken
) are no longer referencedcpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
139-139
: Signature migration to BlockManager::addSequence w/ isShareLastContextBlock looks correct.
- The false/true toggles for sharing last context block align with the updated API and the occupancy assertions.
- Duplicate/oom-path EXPECT_THROW checks still capture the intended failure cases.
Also applies to: 154-154, 168-173, 176-182
2415-2415
: Comment-only changes (Full attention).No action needed.
Also applies to: 3645-3645
2674-2674
: Updated free-block accounting after addToken matches the new sharing semantics.The expectations reflect the decoupled “shared context + per-beam decode” model. Looks good.
Also applies to: 2767-2769
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (7)
1149-1155
: Per-window delegation for reuse path looks goodClear separation of the reuse-enabled overload with explicit window routing. No issues.
1196-1203
: Non-reuse addSequence (BlockManager) API simplification is clearThe explicit
isShareLastContextBlock
flag removes hidden sharing semantics. Delegation per window is correct.
1204-1219
: Single-block case and unsigned safety handled
- Validates
numContextBlocks > 0
before looping withnumContextBlocks - 1
.- Always allocates the last block honoring
isShareLastContextBlock
.Looks correct after the earlier discussion.
1847-1865
: Per-window offsets update: good layeringRouting via
BlockManager::updateSequenceCacheBlockOffsets
to per-windowsetOffsets
improves modularity and keeps window concerns localized. No issues.
1867-1883
: Targeted last-block offsets update is correctOnly computes offsets for the last block across beams, avoiding redundant work. Delegation to window manager is consistent.
1885-1900
: Index-specific offsets update is correctThe index-based update delegates per-window and iterates beams. API shape and usage look consistent.
1939-1987
: Non-reuse addSequence: sharing rule simplified and per-window offsets updated
- The
isShareLastContextBlock
condition is explicit and easier to reason about.- Offsets are updated per window after allocation.
This aligns with the PR goal of simplifying non-reuse flows.
PR_Github #15697 [ run ] completed with state |
Just here to say I'm looking forward to this landing! |
Description
This MR is a preliminary MR for implementing the SWA reuse mechanism for the kv cache manager. Please be aware that no functional change is intended in this merge request. The purpose of the clean-up is to decouple and remove existing functions for the up-coming SWA KV cache reuse change to be more natural and easier to review.
Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We do not want to complicate the code base by stacking more features upon something that does not work. This MR prunes out the logic and add assertions so we can come back and re-support the broken feature and remove the assertion.
Since streamLLM (sink attention) is broken now, assertion is added under
KVCacheManager
ctor to guard for the value ofmSinkBlockTokenLength
andmSinkBubbleLength
. Compute logics relate to it are pruned.The beam search with SWA will still be broke when introducing the SWA KV cache reuse. We will revisit this problem in the future.
On top of this, we should make an effort to update the supporting matrix of the kv cache manager after merging the support of SWA KV cache reuse.
Changes are listed as following:
KVCacheManager::updateToken
intoKVCacheManager::addToken
andKVCacheManager::removeToken
. The functionality should be decoupled.cacheSequenceBlockOffsets
andcacheNewBlockOffset
fromKVCacheManager
down toWindowBlockManager
.KVCacheManager
-exposed functions should be real utilities that users of the structure can leverage. Implementation-detailed function calls should not exist at this level.KVCacheManager::addSequence
.Test Coverage
Since no functional change is intended in this merge request, no test case is added. Several comments are added for future test coverage reminder.
For
LlmRequestTest.ParamTest
,streaming=True
is commented out because we guard sink attention with assertion now.In
capacitySchedulerTest
,addToken
action to crossKVCacheManager is removed because in encoder-decoder model, generation tokens are added only to the decoder and not to the encoder.Summary by CodeRabbit
New Features
Refactor
Behavior Changes
Tests