KEMBAR78
[TRTLLM-7096][infra] Testing cache transmission functionality in Python by bo-nv · Pull Request #7025 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@bo-nv
Copy link
Collaborator

@bo-nv bo-nv commented Aug 19, 2025

Summary by CodeRabbit

  • Tests
    • Added comprehensive unit tests for kv-cache transceiver behavior in single-process scenarios.
    • Parameterized coverage across multiple numeric precisions (FP8, FP16, BF16) and attention modes (Default, MLA).
    • Validates parity between context-only and generation-only phases to ensure consistent cache handling.
    • Introduced reusable test helpers and a pytest fixture to streamline setup and improve reliability.
    • Improves regression coverage and confidence in cache consistency across configurations.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 19, 2025

📝 Walkthrough

Walkthrough

Adds a new pytest test module that verifies kv-cache transceiver behavior in a single-process scenario across three dtype pairs (FP8, FP16, BF16) and two attention modes; includes helpers to create/seed KVCacheManager instances, a dtype fixture, and a parameterized test that runs context-only then generation-only requests and compares final buffers.

Changes

Cohort / File(s) Summary of Changes
Unit tests: kv-cache transceiver
tests/unittest/others/test_kv_cache_transceiver.py
New test module. Adds create_kv_cache_manager(mapping, dtype) and fill_kv_cache_buffer(kv_cache_manager) helpers, a ctx_gen_kv_cache_dtype pytest fixture (validates dtype pairs), and test_kv_cache_transceiver_single_process parameterized over dtype pairs (FP8/FP8, FP16/FP16, BF16/BF16) and attention types (DEFAULT, MLA). Builds separate context and generation KVCacheManager instances, executes context-only then generation-only requests to trigger transceiver transfers, and asserts final kv-cache buffers match.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PT as Pytest
  participant Ctx as KVCacheManager (Context)
  participant Gen as KVCacheManager (Generation)

  Note over PT: Parameterize over dtype pairs (FP8, FP16, BF16) and attention types (DEFAULT, MLA)

  PT->>Ctx: create_kv_cache_manager(mapping, dtype_ctx)
  PT->>Gen: create_kv_cache_manager(mapping, dtype_gen)
  PT->>Ctx: fill_kv_cache_buffer(Ctx)

  rect rgb(235,245,255)
    note right of Ctx: Context-only phase
    PT->>Ctx: configure & run context-only request
    Ctx-->>PT: updated kv-cache buffers
  end

  rect rgb(235,255,235)
    note right of Gen: Generation-only phase
    PT->>Gen: configure & run generation-only request
    Gen-->>PT: updated kv-cache buffers
  end

  PT->>PT: compare buffers
  Note over PT: Assert Ctx buffers == Gen buffers
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • Shixiaowei02
  • litaotju

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6fb4f74 and fda1432.

📒 Files selected for processing (1)
  • tests/unittest/others/test_kv_cache_transceiver.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unittest/others/test_kv_cache_transceiver.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
tests/unittest/_torch/test_cache_transceiver.py (8)

6-7: Prefer module-namespace import for consistency.

Follow the guideline to keep module namespaces in imports.

-from tensorrt_llm._torch.pyexecutor.kv_cache_transceiver import \
-    create_kv_cache_transceiver
+from tensorrt_llm._torch.pyexecutor import kv_cache_transceiver as kv_tx

Outside this hunk, update call sites to kv_tx.create_kv_cache_transceiver(...) (see a separate suggestion below).


61-66: Update call sites after switching to module-namespace import.

Use the aliased module to invoke the factory function.

-        self.kv_cache_transceiver_0 = create_kv_cache_transceiver(
+        self.kv_cache_transceiver_0 = kv_tx.create_kv_cache_transceiver(
             self.mapping, self.kv_cache_manager_0, AttentionTypeCpp.DEFAULT,
             cache_transceiver_config)
-        self.kv_cache_transceiver_1 = create_kv_cache_transceiver(
+        self.kv_cache_transceiver_1 = kv_tx.create_kv_cache_transceiver(
             self.mapping, self.kv_cache_manager_1, AttentionTypeCpp.DEFAULT,
             cache_transceiver_config)

20-36: Minor: use the DataType alias consistently.

You already assigned DataType = tensorrt_llm.bindings.DataType; use it here for readability.

-            dtype=tensorrt_llm.bindings.DataType.FLOAT,
+            dtype=DataType.FLOAT,

37-46: Remove unused helper.

_create_requests is not referenced and can be removed to reduce noise.

-    def _create_requests(self, request_ids):
-        requests = []
-        for request_id in request_ids:
-            requests.append(
-                LlmRequest(
-                    request_id=request_id,
-                    max_new_tokens=1,
-                ))
-        return requests

47-51: Delete commented-out stubs.

These TODO stubs add clutter without value.

-    #def _get_context_phase_params(self):
-    #return None
-
-    #def _prepare_

79-83: Async send: ensure readiness before consuming context-phase data.

respond_and_send_async suggests asynchronous behavior. If request.context_phase_params is populated only after the send progresses, consuming it immediately may be racy.

Consider gating on a status/ready check (or waiting API) before constructing the generation-only request, e.g., poll a readiness predicate with a reasonable timeout. If the API guarantees immediate population, please ignore.


98-101: Rename variables and replace prints with assertions.

Use meaningful names and assertions in tests. Avoid printing in unit tests.

-        wtf0 = self.kv_cache_manager_0.get_cache_indices(request)
-        wtf1 = self.kv_cache_manager_1.get_cache_indices(request_1)
-        print(wtf0)
-        print(wtf1)
+        cache_indices_0 = self.kv_cache_manager_0.get_cache_indices(request)
+        cache_indices_1 = self.kv_cache_manager_1.get_cache_indices(request_1)
+        self.assertIsNotNone(cache_indices_0)
+        self.assertIsNotNone(cache_indices_1)

68-78: Expose a public SamplingParams→SamplingConfig conversion

There isn’t a non-underscored method on SamplingParams for producing a tllme.SamplingConfig, so tests today must call the private _get_sampling_config(). To avoid future breakage if internals change, please:

  • In tensorrt_llm/sampling_params.py
    Add a public alias, e.g.:
    def get_sampling_config(self) -> tllme.SamplingConfig:
        return self._get_sampling_config()
  • Update tests (e.g. tests/unittest/_torch/test_cache_transceiver.py) to use get_sampling_config() instead of _get_sampling_config().
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bff5fdf and d85c858.

📒 Files selected for processing (1)
  • tests/unittest/_torch/test_cache_transceiver.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/_torch/test_cache_transceiver.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/_torch/test_cache_transceiver.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/_torch/test_cache_transceiver.py (2)

53-67: Verify Mapping Usage for Multi-Rank Cache Transceiver
The existing test instantiates both transceivers on rank 0 (world_size=2, tp_size=2). In practice, to exercise inter-rank communication you would use rank 0 and rank 1. However, our multi‐GPU CLI test (tests/unittest/llmapi/test_llm_multi_gpu.py) already verifies Mapping(world_size=tp_size, tp_size, rank=…) across ranks 0 and 1. If the intent of test_cache_transceiver.py is to validate loopback on a single rank, keeping both at rank 0 is acceptable. Otherwise, change to use Mapping(..., rank=0) and Mapping(..., rank=1).

Action items:

  • If loopback behavior is intended (no inter-rank path), add a comment in test_cache_transceiver.py clarifying why both mappings use rank 0.
  • If inter-rank paths should be tested here, update:
    • Line 53: self.mapping_0 = Mapping(world_size=2, tp_size=2, rank=0)
    • Line 54: self.mapping_1 = Mapping(world_size=2, tp_size=2, rank=1)
    • Pass mapping_0/mapping_1 into their respective transceiver calls.

1-1: Add the NVIDIA copyright header.

Per repo guidelines, prepend the current-year NVIDIA copyright header to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 import unittest
⛔ Skipped due to learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

@bo-nv bo-nv force-pushed the user/bo/main/test-cache-in-python branch from d85c858 to 5f9f34a Compare August 19, 2025 07:15
@bo-nv bo-nv requested a review from Shunkangz August 19, 2025 09:57
@bo-nv bo-nv force-pushed the user/bo/main/test-cache-in-python branch from 5f9f34a to df03fc7 Compare August 20, 2025 08:04
@bo-nv bo-nv changed the title [TRTLLM-7096][infra] DO NOT REVIEW [TRTLLM-7096][infra] Testing cache transmission functionality in Python Aug 20, 2025
@bo-nv
Copy link
Collaborator Author

bo-nv commented Aug 20, 2025

@Shunkangz please help review, thanks

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (12)
tests/unittest/_torch/test_kv_cache_transceiver.py (12)

7-12: Prefer module-qualified imports to comply with “maintain module namespace” guideline.

Avoid importing classes/functions directly; import the module and qualify at use sites.

Apply this diff:

-from tensorrt_llm._torch.pyexecutor.kv_cache_transceiver import \
-    create_kv_cache_transceiver
-from tensorrt_llm._torch.pyexecutor.llm_request import LlmRequest
-from tensorrt_llm._torch.pyexecutor.resource_manager import KVCacheManager
-from tensorrt_llm.mapping import Mapping
-from tensorrt_llm.sampling_params import SamplingParams
+import tensorrt_llm._torch.pyexecutor.kv_cache_transceiver as kv_cache_transceiver
+import tensorrt_llm._torch.pyexecutor.llm_request as llm_request
+import tensorrt_llm._torch.pyexecutor.resource_manager as resource_manager
+import tensorrt_llm.mapping as mapping_mod
+import tensorrt_llm.sampling_params as sampling_params_mod

Follow-up use-site changes are suggested below.


19-33: Qualify KVCacheManager with its module after import refactor.

Apply this diff:

-def create_kv_cache_manager(mapping, dtype):
-    return KVCacheManager(
+def create_kv_cache_manager(mapping, dtype):
+    return resource_manager.KVCacheManager(
         trtllm.KvCacheConfig(
             max_tokens=256,
             enable_block_reuse=False,
         ),
         tensorrt_llm.bindings.internal.batch_manager.CacheType.SELF,
         num_layers=1,
         num_kv_heads=1,
         head_dim=1,
         tokens_per_block=8,
         max_seq_len=256,
         max_batch_size=1,
         mapping=mapping,
         dtype=dtype)

45-52: Parameterization looks good; consider guarding FP8 if unsupported on the runner.

FP8 may not be available on all CI GPU SKUs. Consider a conditional skip for FP8.

Example (if you want to keep it self-contained):

 @pytest.mark.timeout(120)
 @pytest.mark.parametrize("kv_cache_dtype",
                          [DataType.FP8, DataType.HALF, DataType.BF16],
                          ids=["fp8", "fp16", "bf16"])
 @pytest.mark.parametrize("attention_type",
                          [AttentionTypeCpp.DEFAULT, AttentionTypeCpp.MLA],
                          ids=["default_attention", "mla_attention"])
 def test_kv_cache_transceiver_single_process(kv_cache_dtype, attention_type):
+    if not torch.cuda.is_available():
+        pytest.skip("CUDA is required for KV cache transceiver test")
+    if kv_cache_dtype == DataType.FP8:
+        # TODO: Replace with a real runtime capability check if available.
+        # FP8 typically requires Hopper or later (e.g., H100).
+        major, minor = torch.cuda.get_device_capability()
+        if major < 9:
+            pytest.skip("Skipping FP8 on non-Hopper GPU")

If the project exposes a proper “is_dtype_supported” check, prefer that over capability heuristics.


53-57: Add a CUDA availability guard at the start of the test.

This avoids failures on CPU-only runners.

Apply this diff:

 def test_kv_cache_transceiver_single_process(kv_cache_dtype, attention_type):
-    # Init kv_cache manager and cache transceiver
+    # Init kv_cache manager and cache transceiver
+    if not torch.cuda.is_available():
+        pytest.skip("CUDA is required for KV cache transceiver test")

54-56: Use module-qualified Mapping to match import refactor.

Apply this diff:

-    mapping = Mapping(world_size=1, rank=0)
+    mapping = mapping_mod.Mapping(world_size=1, rank=0)

58-61: Minor: Align buffer size with cache config to avoid confusion.

max_tokens_in_buffer=512 while KvCacheConfig sets max_tokens=256. It may be intentional, but consider aligning or documenting the discrepancy to avoid future confusion.


62-67: Qualify function name after import refactor; also consider parentheses for line continuation.

Avoid backslashes for line breaks. Use parentheses.

Apply this diff:

-    kv_cache_transceiver_0 = create_kv_cache_transceiver(
-        mapping, kv_cache_manager_0, attention_type, cache_transceiver_config)
+    kv_cache_transceiver_0 = kv_cache_transceiver.create_kv_cache_transceiver(
+        mapping, kv_cache_manager_0, attention_type, cache_transceiver_config)
 
-    kv_cache_transceiver_1 = create_kv_cache_transceiver(
-        mapping, kv_cache_manager_1, attention_type, cache_transceiver_config)
+    kv_cache_transceiver_1 = kv_cache_transceiver.create_kv_cache_transceiver(
+        mapping, kv_cache_manager_1, attention_type, cache_transceiver_config)

71-80: Use module-qualified SamplingParams and LlmRequest after import refactor.

Apply this diff:

-    sampling_params = SamplingParams()
-    ctx_request = LlmRequest(
+    sampling_params = sampling_params_mod.SamplingParams()
+    ctx_request = llm_request.LlmRequest(
         request_id=0,
         max_new_tokens=1,
         input_tokens=list(range(256)),
         sampling_config=tensorrt_llm.bindings.SamplingConfig(
             sampling_params._get_sampling_config()),
         is_streaming=False,
         llm_request_type=LlmRequestType.LLMREQUEST_TYPE_CONTEXT_ONLY)

87-96: Use module-qualified LlmRequest and unique request_id for the generation request.

Reusing request_id=0 for both context and generation may collide depending on internal tracking. Safer to use a distinct id.

Apply this diff:

-    gen_request = LlmRequest(
-        request_id=0,
+    gen_request = llm_request.LlmRequest(
+        request_id=1,
         max_new_tokens=1,
         input_tokens=list(range(256)),
         sampling_config=tensorrt_llm.bindings.SamplingConfig(
             sampling_params._get_sampling_config()),
         is_streaming=False,
         llm_request_type=LlmRequestType.LLMREQUEST_TYPE_GENERATION_ONLY,
         context_phase_params=ctx_request.context_phase_params)

97-101: Avoid internal .impl if possible; otherwise OK. Also, consider synchronizing before status checks on CUDA.

Apply this diff to add a CUDA sync (helps prevent race with async ops on some backends):

     kv_cache_manager_1.impl.add_sequence(gen_request.py_request_id,
                                          gen_request.prompt_len, 1, gen_request)
     # send gen request
     kv_cache_transceiver_1.request_and_receive_async(gen_request)
 
+    # Ensure any device-side work is completed before polling status.
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()

If a public add_sequence exists, prefer it over .impl.add_sequence(...).


102-106: Prefer explicit assertions for transfer status (and verify return semantics).

The current code raises on truthy return; assert the expected value instead. Also confirm the meaning of the return value for check_*_transfer_status.

Apply this diff:

-    if kv_cache_transceiver_0.check_context_transfer_status(1):
-        raise Exception("Context transfer status is not 0")
-    if kv_cache_transceiver_1.check_gen_transfer_status(1):
-        raise Exception("Gen transfer status is not 0")
+    assert kv_cache_transceiver_0.check_context_transfer_status(1) == 0, \
+        "Context transfer status is not 0"
+    assert kv_cache_transceiver_1.check_gen_transfer_status(1) == 0, \
+        "Gen transfer status is not 0"

107-110: Use tolerance-based comparison for numeric buffers.

FP8/FP16/BF16 may incur rounding differences; torch.equal requires exact bitwise match. Use allclose on a common dtype for robustness.

Apply this diff:

-    assert torch.equal(
-        kv_cache_manager_1.get_buffers(0),
-        kv_cache_manager_0.get_buffers(0)), "different kv-cache values"
+    lhs = kv_cache_manager_1.get_buffers(0).to(torch.float32)
+    rhs = kv_cache_manager_0.get_buffers(0).to(torch.float32)
+    assert torch.allclose(lhs, rhs, rtol=1e-3, atol=1e-3), "different kv-cache values"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5f9f34a and df03fc7.

📒 Files selected for processing (1)
  • tests/unittest/_torch/test_kv_cache_transceiver.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/_torch/test_kv_cache_transceiver.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/_torch/test_kv_cache_transceiver.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/unittest/_torch/test_kv_cache_transceiver.py (1)

81-85: Avoid relying on internal .impl attribute unless there’s no supported public API.

Direct access to kv_cache_manager.impl.add_sequence(...) couples the test to internal implementation details and may be brittle.

If a public method exists (e.g., kv_cache_manager.add_sequence(...)), prefer that. If not, consider adding a thin public wrapper to exercise the intended API surface.

@bo-nv bo-nv force-pushed the user/bo/main/test-cache-in-python branch from df03fc7 to 4d44bab Compare August 21, 2025 04:39
@bo-nv
Copy link
Collaborator Author

bo-nv commented Aug 21, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15990 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (6)
tests/unittest/others/test_kv_cache_transceiver.py (6)

3-4: Gracefully skip when TensorRT-LLM is unavailable.

Use pytest.importorskip before importing tensorrt_llm to avoid collection failures in environments without TRT-LLM.

 import pytest
 import torch
 
+pytest.importorskip("tensorrt_llm")
 import tensorrt_llm

7-13: Prefer module-namespace imports for internal packages.

Guidelines recommend keeping the module namespace (use module.symbol), which also avoids polluting the test’s global namespace. Optional, but improves clarity when skimming tests.

Example (not exhaustive):

-from tensorrt_llm._torch.pyexecutor.kv_cache_transceiver import \
-    create_kv_cache_transceiver
-from tensorrt_llm._torch.pyexecutor.llm_request import LlmRequest
-from tensorrt_llm._torch.pyexecutor.resource_manager import KVCacheManager
+from tensorrt_llm._torch.pyexecutor import kv_cache_transceiver as kv_tx
+from tensorrt_llm._torch.pyexecutor import llm_request as llmr
+from tensorrt_llm._torch.pyexecutor import resource_manager as rm

And then use:

  • kv_tx.create_kv_cache_transceiver(...)
  • llmr.LlmRequest(...)
  • rm.KVCacheManager(...)

19-33: Type hints and a brief docstring for the helper.

Minor polish: add typing to make test intent clearer and IDE-friendly.

-def create_kv_cache_manager(mapping, dtype):
+def create_kv_cache_manager(mapping: Mapping, dtype: DataType) -> KVCacheManager:
+    """Create a minimal single-layer KVCacheManager for testing."""
     return KVCacheManager(
         trtllm.KvCacheConfig(
             max_tokens=256,
             enable_block_reuse=False,
         ),
         tensorrt_llm.bindings.internal.batch_manager.CacheType.SELF,
         num_layers=1,
         num_kv_heads=1,
         head_dim=1,
         tokens_per_block=8,
         max_seq_len=256,
         max_batch_size=1,
         mapping=mapping,
         dtype=dtype)

45-55: Fixture indirection is unnecessary; parametrize directly over dtype pairs.

Current indirection via strings makes the matrix harder to extend. Consider parameterizing with tuples of DataType directly and drop the fixture.

Example:

@pytest.mark.parametrize(
    "ctx_kv_cache_dtype,gen_kv_cache_dtype",
    [
        (DataType.FP8, DataType.FP8),
        (DataType.HALF, DataType.HALF),
        (DataType.BF16, DataType.BF16),
    ],
    ids=["ctx_fp8_gen_fp8", "ctx_fp16_gen_fp16", "ctx_bf16_gen_bf16"],
)
def test_kv_cache_transceiver_single_process(ctx_kv_cache_dtype, gen_kv_cache_dtype, attention_type):
    ...

84-85: Initialize the cache after sequence allocation, before sending.

Filling the buffer prior to add_sequence risks writing into memory not yet associated with the sequence (implementation-dependent). Safer to fill after sequence is added and before send.

-    fill_kv_cache_buffer(kv_cache_manager_ctx)
+    # Fill after sequence allocation to ensure buffers belong to the sequence

@@
     kv_cache_manager_ctx.impl.add_sequence(ctx_request.py_request_id,
                                            ctx_request.prompt_len, 1,
                                            ctx_request)
+    # Fill KV cache with known data for transfer verification
+    fill_kv_cache_buffer(kv_cache_manager_ctx)
     # send ctx request
     kv_cache_transceiver_ctx.respond_and_send_async(ctx_request)

Also applies to: 97-101


88-116: Minor readability: avoid magic numbers and add inline comments for “1”.

The third argument “1” in add_sequence is unclear. Consider naming a local variable (e.g., num_sequences=1 or beam_width=1) or add a brief comment.

-    kv_cache_manager_ctx.impl.add_sequence(ctx_request.py_request_id,
-                                           ctx_request.prompt_len, 1,
-                                           ctx_request)
+    # The third arg is <beam_width>=1
+    kv_cache_manager_ctx.impl.add_sequence(ctx_request.py_request_id,
+                                           ctx_request.prompt_len, 1,
+                                           ctx_request)

Repeat for the gen side.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between df03fc7 and 4d44bab.

📒 Files selected for processing (1)
  • tests/unittest/others/test_kv_cache_transceiver.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/others/test_kv_cache_transceiver.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/others/test_kv_cache_transceiver.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/others/test_kv_cache_transceiver.py (2)

57-66: Good parametrization coverage.

Covering MHA and MLA plus multiple dtypes is valuable. The 120s timeout is reasonable for CI.


97-99: The script above will dump:

  • The first 150 lines of the test (test_kv_cache_transceiver.py) so we can see how kv_cache_manager_ctx and its .impl.add_sequence calls are structured.
  • The public API on KVCacheManager (runtime/kv_cache_manager.py) around its add_sequence method.
  • The implementation in pools_kv_cache_manager.py to compare signatures.

Once we confirm that KVCacheManager exposes a public add_sequence(...) with a matching signature, we can update the tests to call kv_cache_manager_ctx.add_sequence(...) (and similarly for kv_cache_manager_gen) instead of reaching into .impl.

I’ll review the outputs to verify the correct signature and then draft the revised test snippets.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15990 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12018 completed with status: 'FAILURE'

@Shunkangz Shunkangz force-pushed the user/bo/main/test-cache-in-python branch from 4d44bab to 6fb4f74 Compare August 21, 2025 10:29
@Shunkangz
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16047 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16047 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12068 completed with status: 'FAILURE'

@bo-nv
Copy link
Collaborator Author

bo-nv commented Aug 22, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16125 [ run ] triggered by Bot

bo-nv added 3 commits August 22, 2025 14:48
Signed-off-by: Bo Deng <deemod@nvidia.com>

fix

Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
@bo-nv bo-nv force-pushed the user/bo/main/test-cache-in-python branch from 6fb4f74 to fda1432 Compare August 22, 2025 06:48
@bo-nv
Copy link
Collaborator Author

bo-nv commented Aug 22, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16141 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16125 [ run ] completed with state ABORTED

@bo-nv
Copy link
Collaborator Author

bo-nv commented Aug 22, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16168 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16141 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16168 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12167 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Comment on lines +58 to +62
@pytest.mark.parametrize(
"ctx_gen_kv_cache_dtype",
["ctx_fp8_gen_fp8", "ctx_fp16_gen_fp16", "ctx_bf16_gen_bf16"],
ids=["ctx_fp8_gen_fp8", "ctx_fp16_gen_fp16", "ctx_bf16_gen_bf16"],
indirect=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine in this version. I can help to refactor this by parametrizing context and generation cache type with two separate inputs in the future PR.

@bo-nv bo-nv enabled auto-merge (squash) August 25, 2025 01:47
@bo-nv bo-nv merged commit 6e13160 into NVIDIA:main Aug 25, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants