-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7096][infra] Testing cache transmission functionality in Python #7025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds a new pytest test module that verifies kv-cache transceiver behavior in a single-process scenario across three dtype pairs (FP8, FP16, BF16) and two attention modes; includes helpers to create/seed KVCacheManager instances, a dtype fixture, and a parameterized test that runs context-only then generation-only requests and compares final buffers. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant PT as Pytest
participant Ctx as KVCacheManager (Context)
participant Gen as KVCacheManager (Generation)
Note over PT: Parameterize over dtype pairs (FP8, FP16, BF16) and attention types (DEFAULT, MLA)
PT->>Ctx: create_kv_cache_manager(mapping, dtype_ctx)
PT->>Gen: create_kv_cache_manager(mapping, dtype_gen)
PT->>Ctx: fill_kv_cache_buffer(Ctx)
rect rgb(235,245,255)
note right of Ctx: Context-only phase
PT->>Ctx: configure & run context-only request
Ctx-->>PT: updated kv-cache buffers
end
rect rgb(235,255,235)
note right of Gen: Generation-only phase
PT->>Gen: configure & run generation-only request
Gen-->>PT: updated kv-cache buffers
end
PT->>PT: compare buffers
Note over PT: Assert Ctx buffers == Gen buffers
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (8)
tests/unittest/_torch/test_cache_transceiver.py (8)
6-7: Prefer module-namespace import for consistency.Follow the guideline to keep module namespaces in imports.
-from tensorrt_llm._torch.pyexecutor.kv_cache_transceiver import \ - create_kv_cache_transceiver +from tensorrt_llm._torch.pyexecutor import kv_cache_transceiver as kv_txOutside this hunk, update call sites to
kv_tx.create_kv_cache_transceiver(...)(see a separate suggestion below).
61-66: Update call sites after switching to module-namespace import.Use the aliased module to invoke the factory function.
- self.kv_cache_transceiver_0 = create_kv_cache_transceiver( + self.kv_cache_transceiver_0 = kv_tx.create_kv_cache_transceiver( self.mapping, self.kv_cache_manager_0, AttentionTypeCpp.DEFAULT, cache_transceiver_config) - self.kv_cache_transceiver_1 = create_kv_cache_transceiver( + self.kv_cache_transceiver_1 = kv_tx.create_kv_cache_transceiver( self.mapping, self.kv_cache_manager_1, AttentionTypeCpp.DEFAULT, cache_transceiver_config)
20-36: Minor: use the DataType alias consistently.You already assigned
DataType = tensorrt_llm.bindings.DataType; use it here for readability.- dtype=tensorrt_llm.bindings.DataType.FLOAT, + dtype=DataType.FLOAT,
37-46: Remove unused helper.
_create_requestsis not referenced and can be removed to reduce noise.- def _create_requests(self, request_ids): - requests = [] - for request_id in request_ids: - requests.append( - LlmRequest( - request_id=request_id, - max_new_tokens=1, - )) - return requests
47-51: Delete commented-out stubs.These TODO stubs add clutter without value.
- #def _get_context_phase_params(self): - #return None - - #def _prepare_
79-83: Async send: ensure readiness before consuming context-phase data.
respond_and_send_asyncsuggests asynchronous behavior. Ifrequest.context_phase_paramsis populated only after the send progresses, consuming it immediately may be racy.Consider gating on a status/ready check (or waiting API) before constructing the generation-only request, e.g., poll a readiness predicate with a reasonable timeout. If the API guarantees immediate population, please ignore.
98-101: Rename variables and replace prints with assertions.Use meaningful names and assertions in tests. Avoid printing in unit tests.
- wtf0 = self.kv_cache_manager_0.get_cache_indices(request) - wtf1 = self.kv_cache_manager_1.get_cache_indices(request_1) - print(wtf0) - print(wtf1) + cache_indices_0 = self.kv_cache_manager_0.get_cache_indices(request) + cache_indices_1 = self.kv_cache_manager_1.get_cache_indices(request_1) + self.assertIsNotNone(cache_indices_0) + self.assertIsNotNone(cache_indices_1)
68-78: Expose a public SamplingParams→SamplingConfig conversionThere isn’t a non-underscored method on
SamplingParamsfor producing atllme.SamplingConfig, so tests today must call the private_get_sampling_config(). To avoid future breakage if internals change, please:
- In
tensorrt_llm/sampling_params.py
Add a public alias, e.g.:def get_sampling_config(self) -> tllme.SamplingConfig: return self._get_sampling_config()- Update tests (e.g.
tests/unittest/_torch/test_cache_transceiver.py) to useget_sampling_config()instead of_get_sampling_config().
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/unittest/_torch/test_cache_transceiver.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/_torch/test_cache_transceiver.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/unittest/_torch/test_cache_transceiver.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/_torch/test_cache_transceiver.py (2)
53-67: Verify Mapping Usage for Multi-Rank Cache Transceiver
The existing test instantiates both transceivers on rank 0 (world_size=2, tp_size=2). In practice, to exercise inter-rank communication you would use rank 0 and rank 1. However, our multi‐GPU CLI test (tests/unittest/llmapi/test_llm_multi_gpu.py) already verifiesMapping(world_size=tp_size, tp_size, rank=…)across ranks 0 and 1. If the intent oftest_cache_transceiver.pyis to validate loopback on a single rank, keeping both at rank 0 is acceptable. Otherwise, change to useMapping(..., rank=0)andMapping(..., rank=1).Action items:
- If loopback behavior is intended (no inter-rank path), add a comment in
test_cache_transceiver.pyclarifying why both mappings use rank 0.- If inter-rank paths should be tested here, update:
• Line 53:self.mapping_0 = Mapping(world_size=2, tp_size=2, rank=0)
• Line 54:self.mapping_1 = Mapping(world_size=2, tp_size=2, rank=1)
• Passmapping_0/mapping_1into their respective transceiver calls.
1-1: Add the NVIDIA copyright header.Per repo guidelines, prepend the current-year NVIDIA copyright header to all source files.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. import unittest⛔ Skipped due to learnings
Learnt from: galagam PR: NVIDIA/TensorRT-LLM#6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
d85c858 to
5f9f34a
Compare
5f9f34a to
df03fc7
Compare
|
@Shunkangz please help review, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (12)
tests/unittest/_torch/test_kv_cache_transceiver.py (12)
7-12: Prefer module-qualified imports to comply with “maintain module namespace” guideline.Avoid importing classes/functions directly; import the module and qualify at use sites.
Apply this diff:
-from tensorrt_llm._torch.pyexecutor.kv_cache_transceiver import \ - create_kv_cache_transceiver -from tensorrt_llm._torch.pyexecutor.llm_request import LlmRequest -from tensorrt_llm._torch.pyexecutor.resource_manager import KVCacheManager -from tensorrt_llm.mapping import Mapping -from tensorrt_llm.sampling_params import SamplingParams +import tensorrt_llm._torch.pyexecutor.kv_cache_transceiver as kv_cache_transceiver +import tensorrt_llm._torch.pyexecutor.llm_request as llm_request +import tensorrt_llm._torch.pyexecutor.resource_manager as resource_manager +import tensorrt_llm.mapping as mapping_mod +import tensorrt_llm.sampling_params as sampling_params_modFollow-up use-site changes are suggested below.
19-33: Qualify KVCacheManager with its module after import refactor.Apply this diff:
-def create_kv_cache_manager(mapping, dtype): - return KVCacheManager( +def create_kv_cache_manager(mapping, dtype): + return resource_manager.KVCacheManager( trtllm.KvCacheConfig( max_tokens=256, enable_block_reuse=False, ), tensorrt_llm.bindings.internal.batch_manager.CacheType.SELF, num_layers=1, num_kv_heads=1, head_dim=1, tokens_per_block=8, max_seq_len=256, max_batch_size=1, mapping=mapping, dtype=dtype)
45-52: Parameterization looks good; consider guarding FP8 if unsupported on the runner.FP8 may not be available on all CI GPU SKUs. Consider a conditional skip for FP8.
Example (if you want to keep it self-contained):
@pytest.mark.timeout(120) @pytest.mark.parametrize("kv_cache_dtype", [DataType.FP8, DataType.HALF, DataType.BF16], ids=["fp8", "fp16", "bf16"]) @pytest.mark.parametrize("attention_type", [AttentionTypeCpp.DEFAULT, AttentionTypeCpp.MLA], ids=["default_attention", "mla_attention"]) def test_kv_cache_transceiver_single_process(kv_cache_dtype, attention_type): + if not torch.cuda.is_available(): + pytest.skip("CUDA is required for KV cache transceiver test") + if kv_cache_dtype == DataType.FP8: + # TODO: Replace with a real runtime capability check if available. + # FP8 typically requires Hopper or later (e.g., H100). + major, minor = torch.cuda.get_device_capability() + if major < 9: + pytest.skip("Skipping FP8 on non-Hopper GPU")If the project exposes a proper “is_dtype_supported” check, prefer that over capability heuristics.
53-57: Add a CUDA availability guard at the start of the test.This avoids failures on CPU-only runners.
Apply this diff:
def test_kv_cache_transceiver_single_process(kv_cache_dtype, attention_type): - # Init kv_cache manager and cache transceiver + # Init kv_cache manager and cache transceiver + if not torch.cuda.is_available(): + pytest.skip("CUDA is required for KV cache transceiver test")
54-56: Use module-qualified Mapping to match import refactor.Apply this diff:
- mapping = Mapping(world_size=1, rank=0) + mapping = mapping_mod.Mapping(world_size=1, rank=0)
58-61: Minor: Align buffer size with cache config to avoid confusion.
max_tokens_in_buffer=512while KvCacheConfig setsmax_tokens=256. It may be intentional, but consider aligning or documenting the discrepancy to avoid future confusion.
62-67: Qualify function name after import refactor; also consider parentheses for line continuation.Avoid backslashes for line breaks. Use parentheses.
Apply this diff:
- kv_cache_transceiver_0 = create_kv_cache_transceiver( - mapping, kv_cache_manager_0, attention_type, cache_transceiver_config) + kv_cache_transceiver_0 = kv_cache_transceiver.create_kv_cache_transceiver( + mapping, kv_cache_manager_0, attention_type, cache_transceiver_config) - kv_cache_transceiver_1 = create_kv_cache_transceiver( - mapping, kv_cache_manager_1, attention_type, cache_transceiver_config) + kv_cache_transceiver_1 = kv_cache_transceiver.create_kv_cache_transceiver( + mapping, kv_cache_manager_1, attention_type, cache_transceiver_config)
71-80: Use module-qualified SamplingParams and LlmRequest after import refactor.Apply this diff:
- sampling_params = SamplingParams() - ctx_request = LlmRequest( + sampling_params = sampling_params_mod.SamplingParams() + ctx_request = llm_request.LlmRequest( request_id=0, max_new_tokens=1, input_tokens=list(range(256)), sampling_config=tensorrt_llm.bindings.SamplingConfig( sampling_params._get_sampling_config()), is_streaming=False, llm_request_type=LlmRequestType.LLMREQUEST_TYPE_CONTEXT_ONLY)
87-96: Use module-qualified LlmRequest and unique request_id for the generation request.Reusing
request_id=0for both context and generation may collide depending on internal tracking. Safer to use a distinct id.Apply this diff:
- gen_request = LlmRequest( - request_id=0, + gen_request = llm_request.LlmRequest( + request_id=1, max_new_tokens=1, input_tokens=list(range(256)), sampling_config=tensorrt_llm.bindings.SamplingConfig( sampling_params._get_sampling_config()), is_streaming=False, llm_request_type=LlmRequestType.LLMREQUEST_TYPE_GENERATION_ONLY, context_phase_params=ctx_request.context_phase_params)
97-101: Avoid internal .impl if possible; otherwise OK. Also, consider synchronizing before status checks on CUDA.Apply this diff to add a CUDA sync (helps prevent race with async ops on some backends):
kv_cache_manager_1.impl.add_sequence(gen_request.py_request_id, gen_request.prompt_len, 1, gen_request) # send gen request kv_cache_transceiver_1.request_and_receive_async(gen_request) + # Ensure any device-side work is completed before polling status. + if torch.cuda.is_available(): + torch.cuda.synchronize()If a public add_sequence exists, prefer it over
.impl.add_sequence(...).
102-106: Prefer explicit assertions for transfer status (and verify return semantics).The current code raises on truthy return; assert the expected value instead. Also confirm the meaning of the return value for
check_*_transfer_status.Apply this diff:
- if kv_cache_transceiver_0.check_context_transfer_status(1): - raise Exception("Context transfer status is not 0") - if kv_cache_transceiver_1.check_gen_transfer_status(1): - raise Exception("Gen transfer status is not 0") + assert kv_cache_transceiver_0.check_context_transfer_status(1) == 0, \ + "Context transfer status is not 0" + assert kv_cache_transceiver_1.check_gen_transfer_status(1) == 0, \ + "Gen transfer status is not 0"
107-110: Use tolerance-based comparison for numeric buffers.FP8/FP16/BF16 may incur rounding differences;
torch.equalrequires exact bitwise match. Useallcloseon a common dtype for robustness.Apply this diff:
- assert torch.equal( - kv_cache_manager_1.get_buffers(0), - kv_cache_manager_0.get_buffers(0)), "different kv-cache values" + lhs = kv_cache_manager_1.get_buffers(0).to(torch.float32) + rhs = kv_cache_manager_0.get_buffers(0).to(torch.float32) + assert torch.allclose(lhs, rhs, rtol=1e-3, atol=1e-3), "different kv-cache values"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/unittest/_torch/test_kv_cache_transceiver.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/_torch/test_kv_cache_transceiver.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/unittest/_torch/test_kv_cache_transceiver.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/unittest/_torch/test_kv_cache_transceiver.py (1)
81-85: Avoid relying on internal .impl attribute unless there’s no supported public API.Direct access to
kv_cache_manager.impl.add_sequence(...)couples the test to internal implementation details and may be brittle.If a public method exists (e.g.,
kv_cache_manager.add_sequence(...)), prefer that. If not, consider adding a thin public wrapper to exercise the intended API surface.
df03fc7 to
4d44bab
Compare
|
/bot run |
|
PR_Github #15990 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (6)
tests/unittest/others/test_kv_cache_transceiver.py (6)
3-4: Gracefully skip when TensorRT-LLM is unavailable.Use pytest.importorskip before importing tensorrt_llm to avoid collection failures in environments without TRT-LLM.
import pytest import torch +pytest.importorskip("tensorrt_llm") import tensorrt_llm
7-13: Prefer module-namespace imports for internal packages.Guidelines recommend keeping the module namespace (use module.symbol), which also avoids polluting the test’s global namespace. Optional, but improves clarity when skimming tests.
Example (not exhaustive):
-from tensorrt_llm._torch.pyexecutor.kv_cache_transceiver import \ - create_kv_cache_transceiver -from tensorrt_llm._torch.pyexecutor.llm_request import LlmRequest -from tensorrt_llm._torch.pyexecutor.resource_manager import KVCacheManager +from tensorrt_llm._torch.pyexecutor import kv_cache_transceiver as kv_tx +from tensorrt_llm._torch.pyexecutor import llm_request as llmr +from tensorrt_llm._torch.pyexecutor import resource_manager as rmAnd then use:
- kv_tx.create_kv_cache_transceiver(...)
- llmr.LlmRequest(...)
- rm.KVCacheManager(...)
19-33: Type hints and a brief docstring for the helper.Minor polish: add typing to make test intent clearer and IDE-friendly.
-def create_kv_cache_manager(mapping, dtype): +def create_kv_cache_manager(mapping: Mapping, dtype: DataType) -> KVCacheManager: + """Create a minimal single-layer KVCacheManager for testing.""" return KVCacheManager( trtllm.KvCacheConfig( max_tokens=256, enable_block_reuse=False, ), tensorrt_llm.bindings.internal.batch_manager.CacheType.SELF, num_layers=1, num_kv_heads=1, head_dim=1, tokens_per_block=8, max_seq_len=256, max_batch_size=1, mapping=mapping, dtype=dtype)
45-55: Fixture indirection is unnecessary; parametrize directly over dtype pairs.Current indirection via strings makes the matrix harder to extend. Consider parameterizing with tuples of DataType directly and drop the fixture.
Example:
@pytest.mark.parametrize( "ctx_kv_cache_dtype,gen_kv_cache_dtype", [ (DataType.FP8, DataType.FP8), (DataType.HALF, DataType.HALF), (DataType.BF16, DataType.BF16), ], ids=["ctx_fp8_gen_fp8", "ctx_fp16_gen_fp16", "ctx_bf16_gen_bf16"], ) def test_kv_cache_transceiver_single_process(ctx_kv_cache_dtype, gen_kv_cache_dtype, attention_type): ...
84-85: Initialize the cache after sequence allocation, before sending.Filling the buffer prior to add_sequence risks writing into memory not yet associated with the sequence (implementation-dependent). Safer to fill after sequence is added and before send.
- fill_kv_cache_buffer(kv_cache_manager_ctx) + # Fill after sequence allocation to ensure buffers belong to the sequence @@ kv_cache_manager_ctx.impl.add_sequence(ctx_request.py_request_id, ctx_request.prompt_len, 1, ctx_request) + # Fill KV cache with known data for transfer verification + fill_kv_cache_buffer(kv_cache_manager_ctx) # send ctx request kv_cache_transceiver_ctx.respond_and_send_async(ctx_request)Also applies to: 97-101
88-116: Minor readability: avoid magic numbers and add inline comments for “1”.The third argument “1” in add_sequence is unclear. Consider naming a local variable (e.g., num_sequences=1 or beam_width=1) or add a brief comment.
- kv_cache_manager_ctx.impl.add_sequence(ctx_request.py_request_id, - ctx_request.prompt_len, 1, - ctx_request) + # The third arg is <beam_width>=1 + kv_cache_manager_ctx.impl.add_sequence(ctx_request.py_request_id, + ctx_request.prompt_len, 1, + ctx_request)Repeat for the gen side.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/unittest/others/test_kv_cache_transceiver.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/others/test_kv_cache_transceiver.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/unittest/others/test_kv_cache_transceiver.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/others/test_kv_cache_transceiver.py (2)
57-66: Good parametrization coverage.Covering MHA and MLA plus multiple dtypes is valuable. The 120s timeout is reasonable for CI.
97-99: The script above will dump:
- The first 150 lines of the test (
test_kv_cache_transceiver.py) so we can see howkv_cache_manager_ctxand its.impl.add_sequencecalls are structured.- The public API on
KVCacheManager(runtime/kv_cache_manager.py) around itsadd_sequencemethod.- The implementation in
pools_kv_cache_manager.pyto compare signatures.Once we confirm that
KVCacheManagerexposes a publicadd_sequence(...)with a matching signature, we can update the tests to callkv_cache_manager_ctx.add_sequence(...)(and similarly forkv_cache_manager_gen) instead of reaching into.impl.I’ll review the outputs to verify the correct signature and then draft the revised test snippets.
|
PR_Github #15990 [ run ] completed with state |
4d44bab to
6fb4f74
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #16047 [ run ] triggered by Bot |
|
PR_Github #16047 [ run ] completed with state |
|
/bot run |
|
PR_Github #16125 [ run ] triggered by Bot |
Signed-off-by: Bo Deng <deemod@nvidia.com> fix Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
6fb4f74 to
fda1432
Compare
|
/bot run |
|
PR_Github #16141 [ run ] triggered by Bot |
|
PR_Github #16125 [ run ] completed with state |
|
/bot run |
|
PR_Github #16168 [ run ] triggered by Bot |
|
PR_Github #16141 [ run ] completed with state |
|
PR_Github #16168 [ run ] completed with state |
| @pytest.mark.parametrize( | ||
| "ctx_gen_kv_cache_dtype", | ||
| ["ctx_fp8_gen_fp8", "ctx_fp16_gen_fp16", "ctx_bf16_gen_bf16"], | ||
| ids=["ctx_fp8_gen_fp8", "ctx_fp16_gen_fp16", "ctx_bf16_gen_bf16"], | ||
| indirect=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine in this version. I can help to refactor this by parametrizing context and generation cache type with two separate inputs in the future PR.
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.