[TRTLLM-7096][infra] Testing cache transmission functionality in Python #7025

bo-nv · 2025-08-19T07:03:10Z

Summary by CodeRabbit

Tests
- Added comprehensive unit tests for kv-cache transceiver behavior in single-process scenarios.
- Parameterized coverage across multiple numeric precisions (FP8, FP16, BF16) and attention modes (Default, MLA).
- Validates parity between context-only and generation-only phases to ensure consistent cache handling.
- Introduced reusable test helpers and a pytest fixture to streamline setup and improve reliability.
- Improves regression coverage and confidence in cache consistency across configurations.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-19T07:03:19Z

📝 Walkthrough

Walkthrough

Adds a new pytest test module that verifies kv-cache transceiver behavior in a single-process scenario across three dtype pairs (FP8, FP16, BF16) and two attention modes; includes helpers to create/seed KVCacheManager instances, a dtype fixture, and a parameterized test that runs context-only then generation-only requests and compares final buffers.

Changes

Cohort / File(s)	Summary of Changes
Unit tests: kv-cache transceiver `tests/unittest/others/test_kv_cache_transceiver.py`	New test module. Adds `create_kv_cache_manager(mapping, dtype)` and `fill_kv_cache_buffer(kv_cache_manager)` helpers, a `ctx_gen_kv_cache_dtype` pytest fixture (validates dtype pairs), and `test_kv_cache_transceiver_single_process` parameterized over dtype pairs (FP8/FP8, FP16/FP16, BF16/BF16) and attention types (DEFAULT, MLA). Builds separate context and generation KVCacheManager instances, executes context-only then generation-only requests to trigger transceiver transfers, and asserts final kv-cache buffers match.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PT as Pytest
  participant Ctx as KVCacheManager (Context)
  participant Gen as KVCacheManager (Generation)

  Note over PT: Parameterize over dtype pairs (FP8, FP16, BF16) and attention types (DEFAULT, MLA)

  PT->>Ctx: create_kv_cache_manager(mapping, dtype_ctx)
  PT->>Gen: create_kv_cache_manager(mapping, dtype_gen)
  PT->>Ctx: fill_kv_cache_buffer(Ctx)

  rect rgb(235,245,255)
    note right of Ctx: Context-only phase
    PT->>Ctx: configure & run context-only request
    Ctx-->>PT: updated kv-cache buffers
  end

  rect rgb(235,255,235)
    note right of Gen: Generation-only phase
    PT->>Gen: configure & run generation-only request
    Gen-->>PT: updated kv-cache buffers
  end

  PT->>PT: compare buffers
  Note over PT: Assert Ctx buffers == Gen buffers

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

Shixiaowei02
litaotju

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6fb4f74 and fda1432.

📒 Files selected for processing (1)

tests/unittest/others/test_kv_cache_transceiver.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unittest/others/test_kv_cache_transceiver.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (8)

tests/unittest/_torch/test_cache_transceiver.py (8)
6-7: Prefer module-namespace import for consistency.

Follow the guideline to keep module namespaces in imports.
-from tensorrt_llm._torch.pyexecutor.kv_cache_transceiver import \
-    create_kv_cache_transceiver
+from tensorrt_llm._torch.pyexecutor import kv_cache_transceiver as kv_tx
Outside this hunk, update call sites to kv_tx.create_kv_cache_transceiver(...) (see a separate suggestion below).

61-66: Update call sites after switching to module-namespace import.

Use the aliased module to invoke the factory function.
-        self.kv_cache_transceiver_0 = create_kv_cache_transceiver(
+        self.kv_cache_transceiver_0 = kv_tx.create_kv_cache_transceiver(
             self.mapping, self.kv_cache_manager_0, AttentionTypeCpp.DEFAULT,
             cache_transceiver_config)
-        self.kv_cache_transceiver_1 = create_kv_cache_transceiver(
+        self.kv_cache_transceiver_1 = kv_tx.create_kv_cache_transceiver(
             self.mapping, self.kv_cache_manager_1, AttentionTypeCpp.DEFAULT,
             cache_transceiver_config)
20-36: Minor: use the DataType alias consistently.

You already assigned DataType = tensorrt_llm.bindings.DataType; use it here for readability.
-            dtype=tensorrt_llm.bindings.DataType.FLOAT,
+            dtype=DataType.FLOAT,
37-46: Remove unused helper.

_create_requests is not referenced and can be removed to reduce noise.
-    def _create_requests(self, request_ids):
-        requests = []
-        for request_id in request_ids:
-            requests.append(
-                LlmRequest(
-                    request_id=request_id,
-                    max_new_tokens=1,
-                ))
-        return requests
47-51: Delete commented-out stubs.

These TODO stubs add clutter without value.
-    #def _get_context_phase_params(self):
-    #return None
-
-    #def _prepare_
79-83: Async send: ensure readiness before consuming context-phase data.

respond_and_send_async suggests asynchronous behavior. If request.context_phase_params is populated only after the send progresses, consuming it immediately may be racy.

Consider gating on a status/ready check (or waiting API) before constructing the generation-only request, e.g., poll a readiness predicate with a reasonable timeout. If the API guarantees immediate population, please ignore.

98-101: Rename variables and replace prints with assertions.

Use meaningful names and assertions in tests. Avoid printing in unit tests.
-        wtf0 = self.kv_cache_manager_0.get_cache_indices(request)
-        wtf1 = self.kv_cache_manager_1.get_cache_indices(request_1)
-        print(wtf0)
-        print(wtf1)
+        cache_indices_0 = self.kv_cache_manager_0.get_cache_indices(request)
+        cache_indices_1 = self.kv_cache_manager_1.get_cache_indices(request_1)
+        self.assertIsNotNone(cache_indices_0)
+        self.assertIsNotNone(cache_indices_1)
68-78: Expose a public SamplingParams→SamplingConfig conversion

There isn’t a non-underscored method on SamplingParams for producing a tllme.SamplingConfig, so tests today must call the private _get_sampling_config(). To avoid future breakage if internals change, please:
In tensorrt_llm/sampling_params.py
Add a public alias, e.g.:
def get_sampling_config(self) -> tllme.SamplingConfig:
    return self._get_sampling_config()
Update tests (e.g. tests/unittest/_torch/test_cache_transceiver.py) to use get_sampling_config() instead of _get_sampling_config().

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bff5fdf and d85c858.

📒 Files selected for processing (1)

tests/unittest/_torch/test_cache_transceiver.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/unittest/_torch/test_cache_transceiver.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/unittest/_torch/test_cache_transceiver.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tests/unittest/_torch/test_cache_transceiver.py (2)

53-67: Verify Mapping Usage for Multi-Rank Cache Transceiver
The existing test instantiates both transceivers on rank 0 (world_size=2, tp_size=2). In practice, to exercise inter-rank communication you would use rank 0 and rank 1. However, our multi‐GPU CLI test (tests/unittest/llmapi/test_llm_multi_gpu.py) already verifies Mapping(world_size=tp_size, tp_size, rank=…) across ranks 0 and 1. If the intent of test_cache_transceiver.py is to validate loopback on a single rank, keeping both at rank 0 is acceptable. Otherwise, change to use Mapping(..., rank=0) and Mapping(..., rank=1).

Action items:

If loopback behavior is intended (no inter-rank path), add a comment in test_cache_transceiver.py clarifying why both mappings use rank 0.

If inter-rank paths should be tested here, update:
• Line 53: self.mapping_0 = Mapping(world_size=2, tp_size=2, rank=0)
• Line 54: self.mapping_1 = Mapping(world_size=2, tp_size=2, rank=1)
• Pass mapping_0/mapping_1 into their respective transceiver calls.

1-1: Add the NVIDIA copyright header.

Per repo guidelines, prepend the current-year NVIDIA copyright header to all source files.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 import unittest
⛔ Skipped due to learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

tests/unittest/_torch/test_cache_transceiver.py

bo-nv · 2025-08-20T08:07:06Z

@Shunkangz please help review, thanks

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (12)

tests/unittest/_torch/test_kv_cache_transceiver.py (12)

7-12: Prefer module-qualified imports to comply with “maintain module namespace” guideline.

Avoid importing classes/functions directly; import the module and qualify at use sites.

Apply this diff:

-from tensorrt_llm._torch.pyexecutor.kv_cache_transceiver import \
-    create_kv_cache_transceiver
-from tensorrt_llm._torch.pyexecutor.llm_request import LlmRequest
-from tensorrt_llm._torch.pyexecutor.resource_manager import KVCacheManager
-from tensorrt_llm.mapping import Mapping
-from tensorrt_llm.sampling_params import SamplingParams
+import tensorrt_llm._torch.pyexecutor.kv_cache_transceiver as kv_cache_transceiver
+import tensorrt_llm._torch.pyexecutor.llm_request as llm_request
+import tensorrt_llm._torch.pyexecutor.resource_manager as resource_manager
+import tensorrt_llm.mapping as mapping_mod
+import tensorrt_llm.sampling_params as sampling_params_mod

Follow-up use-site changes are suggested below.

19-33: Qualify KVCacheManager with its module after import refactor.

Apply this diff:

-def create_kv_cache_manager(mapping, dtype):
-    return KVCacheManager(
+def create_kv_cache_manager(mapping, dtype):
+    return resource_manager.KVCacheManager(
         trtllm.KvCacheConfig(
             max_tokens=256,
             enable_block_reuse=False,
         ),
         tensorrt_llm.bindings.internal.batch_manager.CacheType.SELF,
         num_layers=1,
         num_kv_heads=1,
         head_dim=1,
         tokens_per_block=8,
         max_seq_len=256,
         max_batch_size=1,
         mapping=mapping,
         dtype=dtype)

45-52: Parameterization looks good; consider guarding FP8 if unsupported on the runner.

FP8 may not be available on all CI GPU SKUs. Consider a conditional skip for FP8.

Example (if you want to keep it self-contained):

 @pytest.mark.timeout(120)
 @pytest.mark.parametrize("kv_cache_dtype",
                          [DataType.FP8, DataType.HALF, DataType.BF16],
                          ids=["fp8", "fp16", "bf16"])
 @pytest.mark.parametrize("attention_type",
                          [AttentionTypeCpp.DEFAULT, AttentionTypeCpp.MLA],
                          ids=["default_attention", "mla_attention"])
 def test_kv_cache_transceiver_single_process(kv_cache_dtype, attention_type):
+    if not torch.cuda.is_available():
+        pytest.skip("CUDA is required for KV cache transceiver test")
+    if kv_cache_dtype == DataType.FP8:
+        # TODO: Replace with a real runtime capability check if available.
+        # FP8 typically requires Hopper or later (e.g., H100).
+        major, minor = torch.cuda.get_device_capability()
+        if major < 9:
+            pytest.skip("Skipping FP8 on non-Hopper GPU")

If the project exposes a proper “is_dtype_supported” check, prefer that over capability heuristics.

53-57: Add a CUDA availability guard at the start of the test.

This avoids failures on CPU-only runners.

Apply this diff:

 def test_kv_cache_transceiver_single_process(kv_cache_dtype, attention_type):
-    # Init kv_cache manager and cache transceiver
+    # Init kv_cache manager and cache transceiver
+    if not torch.cuda.is_available():
+        pytest.skip("CUDA is required for KV cache transceiver test")

54-56: Use module-qualified Mapping to match import refactor.

Apply this diff:

-    mapping = Mapping(world_size=1, rank=0)
+    mapping = mapping_mod.Mapping(world_size=1, rank=0)

58-61: Minor: Align buffer size with cache config to avoid confusion.

max_tokens_in_buffer=512 while KvCacheConfig sets max_tokens=256. It may be intentional, but consider aligning or documenting the discrepancy to avoid future confusion.

62-67: Qualify function name after import refactor; also consider parentheses for line continuation.

Avoid backslashes for line breaks. Use parentheses.

Apply this diff:

-    kv_cache_transceiver_0 = create_kv_cache_transceiver(
-        mapping, kv_cache_manager_0, attention_type, cache_transceiver_config)
+    kv_cache_transceiver_0 = kv_cache_transceiver.create_kv_cache_transceiver(
+        mapping, kv_cache_manager_0, attention_type, cache_transceiver_config)
 
-    kv_cache_transceiver_1 = create_kv_cache_transceiver(
-        mapping, kv_cache_manager_1, attention_type, cache_transceiver_config)
+    kv_cache_transceiver_1 = kv_cache_transceiver.create_kv_cache_transceiver(
+        mapping, kv_cache_manager_1, attention_type, cache_transceiver_config)

71-80: Use module-qualified SamplingParams and LlmRequest after import refactor.

Apply this diff:

-    sampling_params = SamplingParams()
-    ctx_request = LlmRequest(
+    sampling_params = sampling_params_mod.SamplingParams()
+    ctx_request = llm_request.LlmRequest(
         request_id=0,
         max_new_tokens=1,
         input_tokens=list(range(256)),
         sampling_config=tensorrt_llm.bindings.SamplingConfig(
             sampling_params._get_sampling_config()),
         is_streaming=False,
         llm_request_type=LlmRequestType.LLMREQUEST_TYPE_CONTEXT_ONLY)

87-96: Use module-qualified LlmRequest and unique request_id for the generation request.

Reusing request_id=0 for both context and generation may collide depending on internal tracking. Safer to use a distinct id.

Apply this diff:

-    gen_request = LlmRequest(
-        request_id=0,
+    gen_request = llm_request.LlmRequest(
+        request_id=1,
         max_new_tokens=1,
         input_tokens=list(range(256)),
         sampling_config=tensorrt_llm.bindings.SamplingConfig(
             sampling_params._get_sampling_config()),
         is_streaming=False,
         llm_request_type=LlmRequestType.LLMREQUEST_TYPE_GENERATION_ONLY,
         context_phase_params=ctx_request.context_phase_params)

97-101: Avoid internal .impl if possible; otherwise OK. Also, consider synchronizing before status checks on CUDA.

Apply this diff to add a CUDA sync (helps prevent race with async ops on some backends):

     kv_cache_manager_1.impl.add_sequence(gen_request.py_request_id,
                                          gen_request.prompt_len, 1, gen_request)
     # send gen request
     kv_cache_transceiver_1.request_and_receive_async(gen_request)
 
+    # Ensure any device-side work is completed before polling status.
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()

If a public add_sequence exists, prefer it over .impl.add_sequence(...).

102-106: Prefer explicit assertions for transfer status (and verify return semantics).

The current code raises on truthy return; assert the expected value instead. Also confirm the meaning of the return value for check_*_transfer_status.

Apply this diff:

-    if kv_cache_transceiver_0.check_context_transfer_status(1):
-        raise Exception("Context transfer status is not 0")
-    if kv_cache_transceiver_1.check_gen_transfer_status(1):
-        raise Exception("Gen transfer status is not 0")
+    assert kv_cache_transceiver_0.check_context_transfer_status(1) == 0, \
+        "Context transfer status is not 0"
+    assert kv_cache_transceiver_1.check_gen_transfer_status(1) == 0, \
+        "Gen transfer status is not 0"

107-110: Use tolerance-based comparison for numeric buffers.

FP8/FP16/BF16 may incur rounding differences; torch.equal requires exact bitwise match. Use allclose on a common dtype for robustness.

Apply this diff:

-    assert torch.equal(
-        kv_cache_manager_1.get_buffers(0),
-        kv_cache_manager_0.get_buffers(0)), "different kv-cache values"
+    lhs = kv_cache_manager_1.get_buffers(0).to(torch.float32)
+    rhs = kv_cache_manager_0.get_buffers(0).to(torch.float32)
+    assert torch.allclose(lhs, rhs, rtol=1e-3, atol=1e-3), "different kv-cache values"

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5f9f34a and df03fc7.

📒 Files selected for processing (1)

tests/unittest/_torch/test_kv_cache_transceiver.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/unittest/_torch/test_kv_cache_transceiver.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/unittest/_torch/test_kv_cache_transceiver.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tests/unittest/_torch/test_kv_cache_transceiver.py (1)

81-85: Avoid relying on internal .impl attribute unless there’s no supported public API.

Direct access to kv_cache_manager.impl.add_sequence(...) couples the test to internal implementation details and may be brittle.

If a public method exists (e.g., kv_cache_manager.add_sequence(...)), prefer that. If not, consider adding a thin public wrapper to exercise the intended API surface.

tests/unittest/_torch/test_kv_cache_transceiver.py

bo-nv · 2025-08-21T04:41:14Z

/bot run

tensorrt-cicd · 2025-08-21T04:46:31Z

PR_Github #15990 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (6)

tests/unittest/others/test_kv_cache_transceiver.py (6)
3-4: Gracefully skip when TensorRT-LLM is unavailable.

Use pytest.importorskip before importing tensorrt_llm to avoid collection failures in environments without TRT-LLM.
 import pytest
 import torch
 
+pytest.importorskip("tensorrt_llm")
 import tensorrt_llm
7-13: Prefer module-namespace imports for internal packages.

Guidelines recommend keeping the module namespace (use module.symbol), which also avoids polluting the test’s global namespace. Optional, but improves clarity when skimming tests.

Example (not exhaustive):
-from tensorrt_llm._torch.pyexecutor.kv_cache_transceiver import \
-    create_kv_cache_transceiver
-from tensorrt_llm._torch.pyexecutor.llm_request import LlmRequest
-from tensorrt_llm._torch.pyexecutor.resource_manager import KVCacheManager
+from tensorrt_llm._torch.pyexecutor import kv_cache_transceiver as kv_tx
+from tensorrt_llm._torch.pyexecutor import llm_request as llmr
+from tensorrt_llm._torch.pyexecutor import resource_manager as rm
And then use:

kv_tx.create_kv_cache_transceiver(...)

llmr.LlmRequest(...)

rm.KVCacheManager(...)

19-33: Type hints and a brief docstring for the helper.

Minor polish: add typing to make test intent clearer and IDE-friendly.
-def create_kv_cache_manager(mapping, dtype):
+def create_kv_cache_manager(mapping: Mapping, dtype: DataType) -> KVCacheManager:
+    """Create a minimal single-layer KVCacheManager for testing."""
     return KVCacheManager(
         trtllm.KvCacheConfig(
             max_tokens=256,
             enable_block_reuse=False,
         ),
         tensorrt_llm.bindings.internal.batch_manager.CacheType.SELF,
         num_layers=1,
         num_kv_heads=1,
         head_dim=1,
         tokens_per_block=8,
         max_seq_len=256,
         max_batch_size=1,
         mapping=mapping,
         dtype=dtype)
45-55: Fixture indirection is unnecessary; parametrize directly over dtype pairs.

Current indirection via strings makes the matrix harder to extend. Consider parameterizing with tuples of DataType directly and drop the fixture.

Example:
@pytest.mark.parametrize(
    "ctx_kv_cache_dtype,gen_kv_cache_dtype",
    [
        (DataType.FP8, DataType.FP8),
        (DataType.HALF, DataType.HALF),
        (DataType.BF16, DataType.BF16),
    ],
    ids=["ctx_fp8_gen_fp8", "ctx_fp16_gen_fp16", "ctx_bf16_gen_bf16"],
)
def test_kv_cache_transceiver_single_process(ctx_kv_cache_dtype, gen_kv_cache_dtype, attention_type):
    ...
84-85: Initialize the cache after sequence allocation, before sending.

Filling the buffer prior to add_sequence risks writing into memory not yet associated with the sequence (implementation-dependent). Safer to fill after sequence is added and before send.
-    fill_kv_cache_buffer(kv_cache_manager_ctx)
+    # Fill after sequence allocation to ensure buffers belong to the sequence

@@
     kv_cache_manager_ctx.impl.add_sequence(ctx_request.py_request_id,
                                            ctx_request.prompt_len, 1,
                                            ctx_request)
+    # Fill KV cache with known data for transfer verification
+    fill_kv_cache_buffer(kv_cache_manager_ctx)
     # send ctx request
     kv_cache_transceiver_ctx.respond_and_send_async(ctx_request)
Also applies to: 97-101

88-116: Minor readability: avoid magic numbers and add inline comments for “1”.

The third argument “1” in add_sequence is unclear. Consider naming a local variable (e.g., num_sequences=1 or beam_width=1) or add a brief comment.
-    kv_cache_manager_ctx.impl.add_sequence(ctx_request.py_request_id,
-                                           ctx_request.prompt_len, 1,
-                                           ctx_request)
+    # The third arg is <beam_width>=1
+    kv_cache_manager_ctx.impl.add_sequence(ctx_request.py_request_id,
+                                           ctx_request.prompt_len, 1,
+                                           ctx_request)
Repeat for the gen side.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between df03fc7 and 4d44bab.

📒 Files selected for processing (1)

tests/unittest/others/test_kv_cache_transceiver.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/unittest/others/test_kv_cache_transceiver.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/unittest/others/test_kv_cache_transceiver.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tests/unittest/others/test_kv_cache_transceiver.py (2)

57-66: Good parametrization coverage.

Covering MHA and MLA plus multiple dtypes is valuable. The 120s timeout is reasonable for CI.

97-99: The script above will dump:

The first 150 lines of the test (test_kv_cache_transceiver.py) so we can see how kv_cache_manager_ctx and its .impl.add_sequence calls are structured.

The public API on KVCacheManager (runtime/kv_cache_manager.py) around its add_sequence method.

The implementation in pools_kv_cache_manager.py to compare signatures.

Once we confirm that KVCacheManager exposes a public add_sequence(...) with a matching signature, we can update the tests to call kv_cache_manager_ctx.add_sequence(...) (and similarly for kv_cache_manager_gen) instead of reaching into .impl.

I’ll review the outputs to verify the correct signature and then draft the revised test snippets.

tests/unittest/others/test_kv_cache_transceiver.py

tensorrt-cicd · 2025-08-21T10:09:35Z

PR_Github #15990 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12018 completed with status: 'FAILURE'

Shunkangz · 2025-08-21T10:29:15Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-21T10:35:51Z

PR_Github #16047 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-22T04:31:17Z

PR_Github #16047 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12068 completed with status: 'FAILURE'

bo-nv · 2025-08-22T05:07:39Z

/bot run

tensorrt-cicd · 2025-08-22T05:13:03Z

PR_Github #16125 [ run ] triggered by Bot

Signed-off-by: Bo Deng <deemod@nvidia.com> fix Signed-off-by: Bo Deng <deemod@nvidia.com>

Signed-off-by: Bo Deng <deemod@nvidia.com>

bo-nv · 2025-08-22T06:48:14Z

/bot run

tensorrt-cicd · 2025-08-22T06:53:25Z

PR_Github #16141 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-22T06:53:27Z

PR_Github #16125 [ run ] completed with state ABORTED

bo-nv · 2025-08-22T08:48:50Z

/bot run

tensorrt-cicd · 2025-08-22T09:26:12Z

PR_Github #16168 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-22T09:26:15Z

PR_Github #16141 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-22T19:02:56Z

PR_Github #16168 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12167 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Shunkangz · 2025-08-25T01:44:38Z

tests/unittest/others/test_kv_cache_transceiver.py

+@pytest.mark.parametrize(
+    "ctx_gen_kv_cache_dtype",
+    ["ctx_fp8_gen_fp8", "ctx_fp16_gen_fp16", "ctx_bf16_gen_bf16"],
+    ids=["ctx_fp8_gen_fp8", "ctx_fp16_gen_fp16", "ctx_bf16_gen_bf16"],
+    indirect=True)


It's fine in this version. I can help to refactor this by parametrizing context and generation cache type with two separate inputs in the future PR.

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

tests/unittest/_torch/test_cache_transceiver.py Outdated Show resolved Hide resolved

bo-nv force-pushed the user/bo/main/test-cache-in-python branch from d85c858 to 5f9f34a Compare August 19, 2025 07:15

bo-nv requested a review from Shunkangz August 19, 2025 09:57

bo-nv force-pushed the user/bo/main/test-cache-in-python branch from 5f9f34a to df03fc7 Compare August 20, 2025 08:04

bo-nv changed the title ~~[TRTLLM-7096][infra] DO NOT REVIEW~~ [TRTLLM-7096][infra] Testing cache transmission functionality in Python Aug 20, 2025

coderabbitai bot reviewed Aug 20, 2025

View reviewed changes

tests/unittest/_torch/test_kv_cache_transceiver.py Show resolved Hide resolved

Shunkangz reviewed Aug 21, 2025

View reviewed changes

bo-nv force-pushed the user/bo/main/test-cache-in-python branch from df03fc7 to 4d44bab Compare August 21, 2025 04:39

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

Shunkangz force-pushed the user/bo/main/test-cache-in-python branch from 4d44bab to 6fb4f74 Compare August 21, 2025 10:29

bo-nv added 3 commits August 22, 2025 14:48

[TRTLLM-7096][infra] Testing cache transmission functionality in Python

4966398

Signed-off-by: Bo Deng <deemod@nvidia.com> fix Signed-off-by: Bo Deng <deemod@nvidia.com>

refactor

b95e16c

Signed-off-by: Bo Deng <deemod@nvidia.com>

refactor again

fda1432

Signed-off-by: Bo Deng <deemod@nvidia.com>

bo-nv force-pushed the user/bo/main/test-cache-in-python branch from 6fb4f74 to fda1432 Compare August 22, 2025 06:48

bo-nv requested review from Shixiaowei02 and Shunkangz August 25, 2025 01:39

Shunkangz approved these changes Aug 25, 2025

View reviewed changes

bo-nv enabled auto-merge (squash) August 25, 2025 01:47

bo-nv merged commit 6e13160 into NVIDIA:main Aug 25, 2025
5 checks passed

coderabbitai bot mentioned this pull request Aug 27, 2025

[None][feat] Support CancelRequest for Disaggregated Serving #6587

Closed

[TRTLLM-7096][infra] Testing cache transmission functionality in Python #7025

[TRTLLM-7096][infra] Testing cache transmission functionality in Python #7025

Uh oh!

Conversation

bo-nv commented Aug 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bo-nv commented Aug 20, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bo-nv commented Aug 21, 2025

Uh oh!

tensorrt-cicd commented Aug 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 21, 2025

Uh oh!

Shunkangz commented Aug 21, 2025

Uh oh!

tensorrt-cicd commented Aug 21, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

bo-nv commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

bo-nv commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

bo-nv commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

Shunkangz Aug 25, 2025

Choose a reason for hiding this comment

bo-nv commented Aug 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 19, 2025 •

edited

Loading