[TRTLLM-8189][chore] enhance GenerationExecutor with RPC (part1) #5543

Superjomn · 2025-06-27T09:00:51Z

This PR adds RPC support to the orchestrator to improve robustness and simplify the codebase.
The design is outlined in these slides.

Changes

1. Lightweight RPC module

Introduces a basic RPC implementation tailored to our current IPC use cases.
Supports three remote call modes:
- Synchronous call: rpc_client.some_func(args...).remote(rpc_args...)
- Asynchronous call: await rpc_client.some_async_func(args...).remote(rpc_args...)
- Async streaming call: async for x in rpc_client.some_async_iter_func(args...).remote(rpc_args...)
Designed to align with existing IPC logic for easier performance parity.
- Pure Python implementation, simple to optimize.
- Native pickle support ensures compatibility.
- Reuses existing components (e.g., ZeroIpcQueue with HMAC).
- Built-in timeout and error propagation.
  - All remote calls return a result or raise RPCTimeout if they exceed the threshold.
Can be reused in other modules that require IPC.

2. RPC wrappers for GenerationExecutor

RpcWorker: Inherits from BaseWorker, exposing methods suitable for remote calls (e.g., fetch_responses_loop_async).
- Runs an MPI main_task, similar to GenerationExecutorWorker.worker_main, but initializes an RPCServer.
RpcProxy: Manages a single main_loop_thread for receiving responses from RpcWorker.

3. Coexistence with existing IPC

Adds orchestrator_type to LlmArgs. Default behavior remains unchanged.
- "RPC": Enables the new RPC path using MPI.
- "Ray": Activates the Ray-based path.
The RPC path will eventually replace the existing IPC once feature and performance parity is achieved.

Roadmap

(This PR): RPC + LLM integration (single-GPU and multi-GPU)
Add full functionality (e.g., fetch stats, KV events); enable in more tests
Align performance with existing IPC
Deprecate and remove the legacy IPC path

Control flow

RPC framework

The whole RPCProxy dataflow

The new RPC path coexists with the IPC path

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Introduced a flexible and scalable RPC framework enabling both synchronous and asynchronous remote procedure calls.
- Added new exception handling for RPC errors and timeouts.
- Provided dynamic client interfaces for seamless remote method invocation.
- Added a non-blocking asynchronous queue retrieval method with timeout support.
- Added an RPC proxy executor for managing remote generation workers via MPI and RPC communication.
- Introduced an RPC worker implementation enabling remote generation execution with graceful shutdown.
Refactor
- Modularized worker functionality by separating core worker logic into a new base class, improving clarity and extensibility.
- Simplified the worker implementation by removing complex engine and adapter management, streamlining request handling.
- Enhanced thread management to allow external control of thread stop events.
Tests
- Added comprehensive unit tests for the new RPC framework, including performance benchmarks, error handling, and server-client interactions.
- Introduced tests validating core worker functionality, including engine creation, request submission, and response handling.
- Added tests for RPC worker operations, verifying request submission, response retrieval, and clean shutdown.

coderabbitai · 2025-07-29T10:59:52Z

📝 Walkthrough

Walkthrough

The changes introduce a new asynchronous RPC framework using ZeroMQ, including both server and client implementations, and add corresponding unit tests. The worker implementation is refactored by extracting a new WorkerBase class to manage core worker responsibilities and engine interaction, while simplifying GenerationExecutorWorker to subclass WorkerBase with minimal logic. Additional minor improvements are made to thread management utilities and IPC queue handling. A new RPC proxy executor and RPC worker module are added to support remote execution via MPI and RPC.

Changes

Cohort / File(s)	Change Summary
RPC Framework Implementation `tensorrt_llm/executor/rpc.py`	Introduces a complete asynchronous and synchronous RPC framework using ZeroMQ, including custom exceptions, request/response structures, and `RPCServer`/`RPCClient` classes with multiple invocation styles and concurrency support.
RPC Proxy Executor `tensorrt_llm/executor/rpc_proxy.py`	Adds `GenerationExecutorRpcProxy` class to manage remote generation workers via MPI sessions and RPC communication. Handles worker launching, remote engine creation, request submission, response awaiting, and periodic main loop for response and stats handling.
RPC Worker Module `tensorrt_llm/executor/rpc_worker.py`	Adds `RpcWorker` subclass of `WorkerBase` with shutdown event and a `main_task` static method to run an RPC server exposing the worker's API, coordinating lifecycle between worker and RPC service.
Worker Refactor and Base Class `tensorrt_llm/executor/worker_base.py`	Adds `WorkerBase` class managing engine initialization, request submission, adapter loading, response handling, and shutdown. Implements `AwaitResponseHelper` for asynchronous response processing supporting different handler modes, integrating LoRA and prompt adapters, multimodal inputs, and backend-specific logic.
Worker Simplification `tensorrt_llm/executor/worker.py`	Refactors `GenerationExecutorWorker` to subclass `WorkerBase` only, removing all engine creation, adapter management, request submission, response handling, and threading helper classes. Simplifies constructor and shutdown logic, removing MPI rank/device setup and backend-specific details.
IPC Queue Enhancement `tensorrt_llm/executor/ipc.py`	Adds `get_async_noblock` method to `ZeroMqQueue`, providing a non-blocking asynchronous queue get operation with a timeout.
Thread Management Utility `tensorrt_llm/llmapi/utils.py`	Updates `ManagedThread` to accept an optional external `stop_event` in the constructor, allowing for shared or externally managed stop signaling.
RPC Unit Tests `tests/unittest/executor/test_rpc.py`	Adds a comprehensive test suite for the new RPC framework, covering server/client lifecycle, synchronous and asynchronous calls, error propagation, fire-and-forget calls, performance, and benchmarking under varied configurations.
WorkerBase Unit Tests `tests/unittest/executor/test_worker_base.py`	Adds unit tests for `WorkerBase` class covering engine creation, request submission, and response awaiting, including utility for executor configuration setup.
RPC Worker Unit Tests `tests/unittest/executor/test_rpc_worker.py`	Adds tests for the RPC worker system including process pool execution of `RpcWorker`, RPC client interactions, request submission, response fetching, and shutdown handling.

Sequence Diagram(s)

sequenceDiagram
    participant Client as RPCClient
    participant Server as RPCServer
    participant Worker as WorkerThread

    Client->>Server: Send RPC request (ZeroMQ)
    Server->>Worker: Dispatch request to worker queue
    Worker->>Worker: Execute method (possibly async/threaded)
    Worker-->>Server: Place response in response queue
    Server->>Client: Send response (ZeroMQ)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

shaharmor98
byshiue
nv-guomingz

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🔭 Outside diff range comments (1)

tensorrt_llm/executor/worker.py (1)

684-832: Significant code duplication in _enqueue_request.

There's substantial duplication between WorkerBase._enqueue_request (lines 190-341) and GenerationExecutorWorker._enqueue_request (lines 684-832). The main differences are:

LoRA adapter caching optimization in GenerationExecutorWorker
Multimodal parameter handling differences
Some parameter differences for PyTorch backend

Consider refactoring to reduce duplication:

class WorkerBase:
    def _enqueue_request(self, request: GenerationRequest) -> int:
        # Common validation
        assert request.id is not None
        
        # Allow subclasses to customize LoRA loading
        lora_config = self._prepare_lora_config(request)
        
        # Common token preparation
        prompt_token_ids = copy.deepcopy(request.prompt_token_ids)
        prompt_tuning_config = self._prepare_prompt_tuning(request, prompt_token_ids)
        
        # Allow subclasses to customize multimodal handling
        multimodal_params = self._prepare_multimodal_params(request)
        
        # Rest of common logic...
        
    def _prepare_lora_config(self, request):
        """Override in subclasses for custom LoRA handling."""
        # Default implementation
        
    def _prepare_multimodal_params(self, request):
        """Override in subclasses for custom multimodal handling."""
        # Default implementation

🧹 Nitpick comments (3)

tests/unittest/executor/test_rpc.py (1)

22-22: Consider making sleep times configurable or adaptive.

Fixed sleep times can cause flaky tests on slower systems or waste time on faster systems.

Consider using a more robust approach:

import time

def wait_for_server_ready(server, timeout=5.0, check_interval=0.1):
    """Wait for server to be ready with timeout."""
    start_time = time.time()
    while time.time() - start_time < timeout:
        try:
            # Try to connect or check server status
            if hasattr(server, 'is_ready') and server.is_ready():
                return True
            time.sleep(check_interval)
        except Exception:
            time.sleep(check_interval)
    raise TimeoutError("Server did not become ready in time")

Also applies to: 37-37, 51-51, 68-68, 85-85, 99-99, 124-124, 127-127, 146-146, 180-180

tensorrt_llm/executor/rpc.py (2)

109-127: Consider thread-safe shutdown.

The shutdown method modifies shared state without synchronization. While the GIL provides some protection, explicit synchronization would be safer.

Consider using a lock for thread-safe shutdown:

 def __init__(self, ...):
+    self._shutdown_lock = threading.Lock()
     ...

 def shutdown(self):
     """Internal method to trigger server shutdown."""
+    with self._shutdown_lock:
+        if self._client_socket is None:
+            return  # Already shut down
         logger.debug(
             "RPC Server shutdown signal received. Terminating server...")
         ...

39-40: Minor: Fix docstring formatting.

Single-line docstrings should fit on one line per PEP 257.

-    """
-    An RPC Server that listens for requests and executes them concurrently.
-    """
+    """An RPC Server that listens for requests and executes them concurrently."""

-    """
-    An RPC Client that connects to the RPCServer.
-    """
+    """An RPC Client that connects to the RPCServer."""

Also applies to: 235-236

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1086e7 and 5d2561f.

📒 Files selected for processing (5)

tensorrt_llm/executor/ipc.py (2 hunks)
tensorrt_llm/executor/rpc.py (1 hunks)
tensorrt_llm/executor/worker.py (7 hunks)
tensorrt_llm/llmapi/utils.py (1 hunks)
tests/unittest/executor/test_rpc.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/llmapi/utils.py
tensorrt_llm/executor/ipc.py
tests/unittest/executor/test_rpc.py
tensorrt_llm/executor/rpc.py
tensorrt_llm/executor/worker.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/llmapi/utils.py
tensorrt_llm/executor/ipc.py
tests/unittest/executor/test_rpc.py
tensorrt_llm/executor/rpc.py
tensorrt_llm/executor/worker.py

🧠 Learnings (2)

tests/unittest/executor/test_rpc.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/executor/worker.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/executor/rpc.py

39-40: One-line docstring should fit on one line

Reformat to one line

(D200)

233-234: One-line docstring should fit on one line

Reformat to one line

(D200)

241-245: 1 blank line required between summary line and description

(D205)

395-397: 1 blank line required between summary line and description

(D205)

406-406: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

410-410: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

417-417: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

tensorrt_llm/executor/worker.py

49-53: 1 blank line required between summary line and description

(D205)

96-96: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

132-132: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

347-347: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

353-355: 1 blank line required between summary line and description

(D205)

382-382: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

387-387: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

392-392: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

461-461: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

509-512: 1 blank line required between summary line and description

(D205)

🔇 Additional comments (9)

tensorrt_llm/llmapi/utils.py (1)

246-253: LGTM! Good enhancement for thread lifecycle management.

The addition of the optional stop_event parameter provides flexibility for external thread control while maintaining backward compatibility. This is particularly useful for coordinating shutdown across multiple threads using a shared event.

tensorrt_llm/executor/ipc.py (1)

1-1: LGTM! Useful addition for non-blocking async operations.

The new get_async_noblock method provides a clean way to perform async receives with timeout, which is essential for the RPC framework's non-blocking communication patterns.

Also applies to: 200-202

tests/unittest/executor/test_rpc.py (1)

168-192: Good test coverage with parameterization!

The benchmark test effectively covers different configurations (async/sync, IPC/TCP) with proper assertions. The performance measurement is useful for regression testing.

tensorrt_llm/executor/rpc.py (2)

280-306: Good error handling in response reader!

The response reader properly handles cancellation, propagates exceptions to pending futures, and cleans up when done. The error propagation pattern is well implemented.

403-431: Clever use of dynamic proxy pattern!

The __getattr__ implementation with MethodProxy provides an elegant API that supports multiple calling styles (sync, async, future). This makes the client very user-friendly.

tensorrt_llm/executor/worker.py (4)

48-54: Excellent refactoring with WorkerBase extraction!

The introduction of WorkerBase provides a clean separation between core worker logic and transport/threading concerns. This modular design will make it easier to implement different worker types (e.g., RPC-based workers) in the future.

571-584: Good defensive programming with queue checks!

The additional checks for queue availability and methods prevent crashes when queues are not properly initialized. The early return pattern is clean.

844-869: Good override of submit method.

The override properly calls the parent class logic while passing self as the executor, maintaining the correct inheritance chain.

893-893: Clean shutdown delegation.

Using _basic_shutdown() from the base class for engine shutdown while handling thread management in the subclass is a good separation of concerns.

tensorrt_llm/executor/rpc.py

tensorrt_llm/executor/worker.py

tests/unittest/executor/test_rpc.py

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

tensorrt_llm/executor/worker.py (1)

118-126: Add validation for client_id in return_queue.

While the method now checks result_queue, it still assumes client_id exists in _results dictionary when accessing it on line 125.

 def return_queue(self, client_id: int):
     """ If a centralized result queue is registered (used for communication with the proxy)
         send the message there.
         Otherwise, push the result directly in the GenerationResult queue.
     """
     if self.result_queue is not None:
         return self.result_queue
+    if client_id not in self._results:
+        raise KeyError(f"No result found for client_id {client_id}")
     return self._results[client_id].queue

♻️ Duplicate comments (2)

tensorrt_llm/executor/rpc.py (2)

84-88: Add validation for socket initialization.

The address property assumes _client_socket is not None, but this could be accessed before bind() is called.

263-274: Potential race condition in client cleanup.

The __del__ method calls close() which cancels _reader_task, but the task might still be executing when the client is garbage collected.

🧹 Nitpick comments (2)

tensorrt_llm/executor/rpc.py (1)
38-42: Fix docstring formatting for consistency.

Several docstrings should be reformatted according to the project's style guide.
 class RPCServer:
-    """
-    An RPC Server that listens for requests and executes them concurrently.
-    """
+    """An RPC Server that listens for requests and executes them concurrently."""
Also apply similar fixes to:

Line 235-238: RPCClient docstring

Line 245-250: init docstring

Line 344-359: call_async docstring

Line 405-407: getattr docstring
tensorrt_llm/executor/worker_base.py (1)
211-233: Fix long lines for better readability.

Several lines exceed the 120 character limit, making the code harder to read.

For example, line 211 can be split:
-        # NOTE: Since, we only support PyTorch backend for multimodal, we will send multimodal_data through the 'py_multimodal_data' field
+        # NOTE: Since we only support PyTorch backend for multimodal, we will send multimodal_data 
+        # through the 'py_multimodal_data' field
Similar fixes should be applied to lines 233, 239, and 297.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d2561f and 6c709ce.

📒 Files selected for processing (6)

tensorrt_llm/executor/ipc.py (2 hunks)
tensorrt_llm/executor/rpc.py (1 hunks)
tensorrt_llm/executor/worker.py (3 hunks)
tensorrt_llm/executor/worker_base.py (1 hunks)
tensorrt_llm/llmapi/utils.py (1 hunks)
tests/unittest/executor/test_rpc.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/llmapi/utils.py
tensorrt_llm/executor/ipc.py
tests/unittest/executor/test_rpc.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/executor/worker_base.py
tensorrt_llm/executor/rpc.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/executor/worker_base.py
tensorrt_llm/executor/rpc.py

🧠 Learnings (3)

tensorrt_llm/executor/worker.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

tensorrt_llm/executor/worker_base.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/executor/rpc.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/executor/worker_base.py

164-164: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

211-211: Line too long (138 > 120)

(E501)

233-233: Line too long (153 > 120)

(E501)

239-239: Line too long (132 > 120)

(E501)

297-297: Line too long (165 > 120)

(E501)

393-393: Undefined name GenerationExecutorWorker

(F821)

531-532: One-line docstring should fit on one line

Reformat to one line

(D200)

tensorrt_llm/executor/rpc.py

39-40: One-line docstring should fit on one line

Reformat to one line

(D200)

233-234: One-line docstring should fit on one line

Reformat to one line

(D200)

241-245: 1 blank line required between summary line and description

(D205)

395-397: 1 blank line required between summary line and description

(D205)

406-406: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

410-410: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

417-417: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

tensorrt_llm/executor/rpc.py

tensorrt_llm/executor/worker_base.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tensorrt_llm/executor/worker.py (1)
124-131: Add validation for client_id existence.

The return_queue method could raise a KeyError if client_id doesn't exist in _results.

Add validation:
 def return_queue(self, client_id: int):
     """ If a centralized result queue is registered (used for communication with the proxy)
         send the message there.
         Otherwise, push the result directly in the GenerationResult queue.
     """
     if self.result_queue is not None:
         return self.result_queue
+    if client_id not in self._results:
+        raise KeyError(f"No result found for client_id {client_id}")
     return self._results[client_id].queue
tensorrt_llm/executor/worker_base.py (1)
395-395: Fix undefined name in type hint.

The type hint references GenerationExecutorWorker which is not imported in this file.
-    def __init__(self, worker: "GenerationExecutorWorker"):
+    def __init__(self, worker: "WorkerBase"):

🧹 Nitpick comments (2)

tensorrt_llm/executor/worker_base.py (2)
69-70: Address TODO: Consider using weakref for circular reference prevention.

The TODO comment suggests making the worker reference a weakref to prevent potential circular references.

Would you like me to implement the weakref pattern to prevent potential memory leaks from circular references between WorkerBase and AwaitResponseHelper?

533-534: Minor: Reformat docstring to single line.

The docstring can fit on a single line for better readability.
-    """Compute logprob and prompt logprob and clear out logits if applicable.
-    """
+    """Compute logprob and prompt logprob and clear out logits if applicable."""

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c709ce and cbfa09e.

📒 Files selected for processing (2)

tensorrt_llm/executor/worker.py (3 hunks)
tensorrt_llm/executor/worker_base.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/executor/worker_base.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/executor/worker_base.py

🧠 Learnings (2)

tensorrt_llm/executor/worker.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

tensorrt_llm/executor/worker_base.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/executor/worker_base.py

166-166: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

213-213: Line too long (138 > 120)

(E501)

235-235: Line too long (153 > 120)

(E501)

241-241: Line too long (132 > 120)

(E501)

299-299: Line too long (165 > 120)

(E501)

395-395: Undefined name GenerationExecutorWorker

(F821)

533-534: One-line docstring should fit on one line

Reformat to one line

(D200)

🔇 Additional comments (4)

tensorrt_llm/executor/worker.py (2)

42-75: Clean refactoring of the constructor.

The constructor has been properly simplified to delegate initialization to WorkerBase and maintains the necessary initialization sequence.

257-279: Proper delegation to parent shutdown.

The shutdown method correctly maintains thread management while delegating core shutdown logic to WorkerBase.

tensorrt_llm/executor/worker_base.py (2)

339-342: Helpful error message for common MPI usage mistake.

The error message provides clear guidance on how to fix the common mistake of calling generate from non-rank-0 processes.

403-430: Clean handler pattern implementation.

The use of enum-based handler kinds with pattern matching provides a clean and extensible design for different response handling modes.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tensorrt_llm/executor/worker.py (1)

116-123: Missing None check for queue access.

The return_queue method assumes client_id exists in _results without validation, which could cause a KeyError.

 def return_queue(self, client_id: int):
     """ If a centralized result queue is registered (used for communication with the proxy)
         send the message there.
         Otherwise, push the result directly in the GenerationResult queue.
     """
     if self.result_queue is not None:
         return self.result_queue
+    if client_id not in self._results:
+        raise KeyError(f"No result found for client_id {client_id}")
     return self._results[client_id].queue

🧹 Nitpick comments (7)

tests/unittest/executor/test_worker_base.py (3)
24-24: Remove unused variable assignment.

The worker variable is assigned but never used in the test. Since the test only verifies that the WorkerBase can be instantiated, you can simplify it.
 def test_create_engine(self):
-    with WorkerBase(engine=model_path) as worker:
+    with WorkerBase(engine=model_path):
         pass
43-46: Consider making the test more deterministic.

The test uses a fixed loop count and only checks if the queue is non-empty. Consider making it more specific about expected outcomes.
-        for i in range(10):
+        # Wait for responses until we get a final response or timeout
+        max_iterations = 20
+        for i in range(max_iterations):
             worker.await_responses()
+            if result_queue.qsize() > 0:
+                break
 
         assert result_queue.qsize() > 0
+        # Optionally, verify the response content
+        response = result_queue.get()
+        assert response is not None
48-67: Remove unused helper method.

The _create_executor_config method is defined but never used in the tests. Consider removing it to keep the test file clean.

If this method is intended for future use, consider adding a test that uses it or document why it's needed.
tensorrt_llm/executor/worker_base.py (4)
81-137: Comprehensive engine creation with multiple backend support.

The create_engine method properly handles different backend types and MPI configuration. Consider breaking this into smaller methods in future refactoring for better maintainability.

234-234: Consider breaking long lines for better readability.

Several lines exceed the 120 character limit. While not critical, consider breaking them up for better readability.

For example, line 234 could be split:
-        # NOTE: Since, we only support PyTorch backend for multimodal, we will send multimodal_data through the 'py_multimodal_data' field
+        # NOTE: Since, we only support PyTorch backend for multimodal, we will send
+        # multimodal_data through the 'py_multimodal_data' field
Also applies to: 256-256, 262-262, 320-320, 398-398

187-187: Add closing punctuation to docstring.

The docstring should end with proper punctuation.
-    """Returns True if the adapter was loaded by this call, False if it was already loaded"""
+    """Returns True if the adapter was loaded by this call, False if it was already loaded."""
571-572: Reformat one-line docstring.

The docstring should fit on a single line per PEP 257.
-    """Compute logprob and prompt logprob and clear out logits if applicable.
-    """
+    """Compute logprob and prompt logprob and clear out logits if applicable."""

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cbfa09e and a61f1ca.

📒 Files selected for processing (3)

tensorrt_llm/executor/worker.py (4 hunks)
tensorrt_llm/executor/worker_base.py (1 hunks)
tests/unittest/executor/test_worker_base.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tests/unittest/executor/test_worker_base.py
tensorrt_llm/executor/worker.py
tensorrt_llm/executor/worker_base.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tests/unittest/executor/test_worker_base.py
tensorrt_llm/executor/worker.py
tensorrt_llm/executor/worker_base.py

🧠 Learnings (3)

tests/unittest/executor/test_worker_base.py (2)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/executor/worker.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

tensorrt_llm/executor/worker_base.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tests/unittest/executor/test_worker_base.py

24-24: Local variable worker is assigned to but never used

Remove assignment to unused variable worker

(F841)

tensorrt_llm/executor/worker_base.py

187-187: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

234-234: Line too long (138 > 120)

(E501)

256-256: Line too long (153 > 120)

(E501)

262-262: Line too long (132 > 120)

(E501)

320-320: Line too long (165 > 120)

(E501)

398-398: Line too long (121 > 120)

(E501)

571-572: One-line docstring should fit on one line

Reformat to one line

(D200)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

tensorrt_llm/executor/worker.py (2)

37-73: Clean refactoring to inherit from WorkerBase.

The refactoring properly delegates core functionality to the WorkerBase parent class while maintaining worker-specific thread management. Good separation of concerns.

112-114: Proper delegation to parent class submit method.

The submit method correctly ensures threads are started before delegating to the parent's implementation.

tensorrt_llm/executor/worker_base.py (3)

39-80: Well-structured base class initialization.

The WorkerBase class is well-designed with clear separation of responsibilities and proper initialization of core components. Good use of weakref to avoid circular references.

208-218: Known race condition in LoRA adapter cache optimization.

This code maintains the existing LoRA cache optimization logic with its known race condition limitation, as documented in previous reviews. This is acceptable as a more comprehensive solution would require significant architectural changes.

427-461: Well-designed response handling with multiple modes.

The AwaitResponseHelper class provides a clean abstraction for different response handling modes (single process vs IPC batched). The type hint is correct.

tests/unittest/executor/test_worker_base.py

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (3)

tensorrt_llm/executor/rpc_proxy.py (3)
33-42: Fix docstring formatting

The docstring doesn't follow proper formatting guidelines as indicated by static analysis.
-        """
+        """Initialize the RPC proxy executor.
+
         Args:
             worker_kwargs: kwargs for the rpc worker
             model_world_size: the world size of the model
             mpi_session: the mpi session to use
             postproc_worker_config: the postproc worker config
             is_llm_executor: whether this is an llm executor
             garbage_collection_gen0_threshold: the garbage collection gen0 threshold
-            clock_unit: the unit of the clock, 1 means 1 second
+            clock_unit: the unit of the clock, 1 means 1 second.
         """
89-90: Incomplete implementations with TODOs

Several methods have TODO comments or raise NotImplementedError, indicating incomplete functionality:

Line 89: get_stats_remote() has TODO

Line 131: handle_stats() raises NotImplementedError

Line 141: create_engine_remote() has TODO

Would you like me to help implement these methods or create issues to track their completion?

Also applies to: 130-131, 141-141

101-129: Add type annotations for better code clarity

The method lacks return type annotation and the nested function could benefit from type hints.
-    def handle_responses(self, responses: list[GenerationResult]) -> bool:
+    def handle_responses(self, responses: list[GenerationResult]) -> None:
         async_queues = []
         event_loop = None

-        def process_res(res):
+        def process_res(res: GenerationResult) -> None:

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a61f1ca and 33c3dbb.

📒 Files selected for processing (2)

tensorrt_llm/executor/rpc_proxy.py (1 hunks)
tensorrt_llm/executor/rpc_worker.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/executor/rpc_proxy.py
tensorrt_llm/executor/rpc_worker.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/executor/rpc_proxy.py
tensorrt_llm/executor/rpc_worker.py

🧠 Learnings (1)

tensorrt_llm/executor/rpc_worker.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/executor/rpc_proxy.py

33-41: 1 blank line required between summary line and description

(D205)

33-41: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

78-79: One-line docstring should fit on one line

Reformat to one line

(D200)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/executor/rpc_worker.py (1)

14-30: Well-designed RPC worker implementation

The RpcWorker class cleanly extends WorkerBase with minimal RPC-specific functionality. The shutdown event coordination pattern is appropriate for managing the lifecycle between the worker and RPC service.

tensorrt_llm/executor/rpc_proxy.py

tensorrt_llm/executor/rpc_worker.py

coderabbitai

Actionable comments posted: 8

🔭 Outside diff range comments (1)

tests/unittest/executor/test_rpc_worker.py (1)
97-100: Use unittest test runner for proper test execution

Replace the manual test execution with unittest's test runner for better test discovery and reporting.
 if __name__ == '__main__':
-    worker = TestRpcWorker()
-    worker.test_main()
+    unittest.main()

♻️ Duplicate comments (9)

tests/unittest/executor/test_rpc.py (1)

17-17: Use unique IPC paths to avoid test conflicts

Hard-coded IPC paths can cause conflicts when tests run in parallel.

Also applies to: 35-35, 49-49, 66-66, 83-83, 97-97, 122-122, 145-145, 176-176
tensorrt_llm/executor/rpc_proxy.py (4)
1-1: Missing NVIDIA copyright header

All TensorRT-LLM source files should contain an NVIDIA copyright header.

43-46: Store worker_kwargs and clock_unit as instance variables

The parameters are used later but not stored as instance variables.
         GenerationExecutorRpcProxy.INSTANCE_COUNTER += 1
         self.rpc_addr = self.gen_uniq_rpc_addr()
         self.rpc_client = RPCClient(self.rpc_addr)
+        self.worker_kwargs = worker_kwargs
+        self.clock_unit = clock_unit
72-77: Undefined name 'rpc_worker_main' and AttributeError for worker_kwargs

The function rpc_worker_main is not imported or defined, and self.worker_kwargs is not defined.

Additionally, you need to import the missing function:
+from .rpc_worker import rpc_worker_main
145-158: Method should return the created MPI session

The method assigns to self.mpi_session but doesn't return it, yet it's called with assignment.
     def _create_mpi_session(self, model_world_size: int,
                             mpi_session: Optional[MpiSession]):
         mpi_process_pre_spawned: bool = get_spawn_proxy_process_env()
         if mpi_session is None:
             if mpi_process_pre_spawned:
                 print_colored_debug('create comm session ...\n', "yellow")
                 self.mpi_session = create_mpi_comm_session(model_world_size)
             else:
                 print_colored_debug('create pool session ...\n', "yellow")
                 self.mpi_session = MpiPoolSession(n_workers=model_world_size)
         else:
             print_colored_debug('using external mpi session ...\n', "yellow")
             self.mpi_session = mpi_session
+        return self.mpi_session
tensorrt_llm/executor/rpc.py (3)

85-88: Add validation for socket initialization

The address property assumes _client_socket is not None.

284-295: Potential race condition in client cleanup

The __del__ method calls close() which cancels _reader_task, but the task might still be executing.

384-387: Handle nested event loops properly

The _call_sync method uses asyncio.run() which will fail if called from within an existing event loop.

tensorrt_llm/executor/worker.py (1)

116-124: Add validation for client_id in return_queue

The method assumes client_id exists in _results.

🧹 Nitpick comments (6)

tests/unittest/executor/test_rpc_worker.py (1)

63-64: Address the known shutdown hang issue

The comments indicate a known issue where responses must be fetched before shutdown to avoid hanging. This should be properly addressed or tracked.

This known issue could lead to test flakiness and production problems. Would you like me to:

Open an issue to track this problem?

Implement a proper solution that ensures all responses are fetched before shutdown?

Add a timeout mechanism to prevent indefinite hanging?

Also applies to: 80-81

tests/unittest/executor/test_rpc.py (1)

231-232: Make timeout error assertion more specific

The assertion for timeout error message should be more specific to ensure the correct error type is raised.

-            assert "timed out" in str(
-                exc_info.value), f"Timeout message not found: {exc_info.value}"
+            assert "timed out" in str(exc_info.value) or "timeout" in str(exc_info.value).lower(), \
+                f"Timeout message not found in error: {exc_info.value}"
+            # Also verify it's an RPCError or RPCTimeout
+            assert isinstance(exc_info.value, RPCError), \
+                f"Expected RPCError, got {type(exc_info.value)}"

tensorrt_llm/executor/rpc_proxy.py (1)

78-93: Improve timing logic clarity in main loop

The modulo operations with clock counter seem arbitrary. Consider using more descriptive timing mechanisms.

     def main_loop_task(self):
         """
         Main loop of the proxy, it will invoke the actions periodically.
         """
-        clock = 0
+        last_stats_time = time.time()
+        stats_interval = 10 * self.clock_unit  # Fetch stats every 10 clock units
+        
         while not self._shutdown_event.is_set():
-            if clock % 1 == 0:
-                responses = self.await_responses_remote()
-                self.handle_responses(responses)
-            if clock % 10 == 0:
+            # Always fetch responses
+            responses = self.await_responses_remote()
+            self.handle_responses(responses)
+            
+            # Fetch stats periodically
+            current_time = time.time()
+            if current_time - last_stats_time >= stats_interval:
                 stats = self.get_stats_remote()  # TODO
                 self.handle_stats(stats)
+                last_stats_time = current_time

-            clock += 1
             time.sleep(self.clock_unit)

tensorrt_llm/executor/rpc.py (1)

179-202: Simplify duplicated executor logic

The async and sync execution branches have nearly identical code that can be consolidated.

                 try:
-                    if self._executor is not None:
-                        # Dispatch to worker thread and await result with timeout
-                        loop = asyncio.get_running_loop()
-
-                        # Create a wrapper function to handle keyword arguments
-                        def call_with_kwargs():
-                            return self._functions[req.method_name](
-                                *req.args, **req.kwargs)
-
-                        result = await asyncio.wait_for(loop.run_in_executor(
-                            self._executor, call_with_kwargs),
-                                                        timeout=req.timeout)
-                    else:
-                        # For synchronous execution, we need to run in executor to support timeout
-                        loop = asyncio.get_running_loop()
-
-                        # Create a wrapper function to handle keyword arguments
-                        def call_with_kwargs():
-                            return self._functions[req.method_name](
-                                *req.args, **req.kwargs)
-
-                        result = await asyncio.wait_for(loop.run_in_executor(
-                            None, call_with_kwargs),
-                                                        timeout=req.timeout)
+                    loop = asyncio.get_running_loop()
+                    
+                    # Create a wrapper function to handle keyword arguments
+                    def call_with_kwargs():
+                        return self._functions[req.method_name](
+                            *req.args, **req.kwargs)
+                    
+                    # Use the configured executor or None for default thread pool
+                    executor = self._executor if self._executor is not None else None
+                    result = await asyncio.wait_for(
+                        loop.run_in_executor(executor, call_with_kwargs),
+                        timeout=req.timeout
+                    )
                     response = RPCResponse(req.request_id, 'OK', result)

tensorrt_llm/executor/worker_base.py (2)

192-193: Fix docstring formatting.

The docstring should end with a period according to Python docstring conventions.

-    """Returns True if the adapter was loaded by this call, False if it was already loaded"""
+    """Returns True if the adapter was loaded by this call, False if it was already loaded."""

632-636: Fix docstring formatting.

The one-line docstring should fit on a single line.

-def _get_logprobs(worker,
-                  response: tllm.Response,
-                  is_pytorch_backend=False) -> Optional[LogProbsResult]:
-    """Compute logprob and prompt logprob and clear out logits if applicable.
-    """
+def _get_logprobs(worker,
+                  response: tllm.Response,
+                  is_pytorch_backend=False) -> Optional[LogProbsResult]:
+    """Compute logprob and prompt logprob and clear out logits if applicable."""

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33c3dbb and ddb12ed.

📒 Files selected for processing (10)

tensorrt_llm/executor/ipc.py (2 hunks)
tensorrt_llm/executor/rpc.py (1 hunks)
tensorrt_llm/executor/rpc_proxy.py (1 hunks)
tensorrt_llm/executor/rpc_worker.py (1 hunks)
tensorrt_llm/executor/worker.py (6 hunks)
tensorrt_llm/executor/worker_base.py (1 hunks)
tensorrt_llm/llmapi/utils.py (1 hunks)
tests/unittest/executor/test_rpc.py (1 hunks)
tests/unittest/executor/test_rpc_worker.py (1 hunks)
tests/unittest/executor/test_worker_base.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

tensorrt_llm/executor/ipc.py
tensorrt_llm/llmapi/utils.py
tensorrt_llm/executor/rpc_worker.py
tests/unittest/executor/test_worker_base.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tests/unittest/executor/test_rpc_worker.py
tensorrt_llm/executor/rpc_proxy.py
tensorrt_llm/executor/rpc.py
tensorrt_llm/executor/worker_base.py
tests/unittest/executor/test_rpc.py
tensorrt_llm/executor/worker.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tests/unittest/executor/test_rpc_worker.py
tensorrt_llm/executor/rpc_proxy.py
tensorrt_llm/executor/rpc.py
tensorrt_llm/executor/worker_base.py
tests/unittest/executor/test_rpc.py
tensorrt_llm/executor/worker.py

🧠 Learnings (6)

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/executor/test_rpc_worker.py
tests/unittest/executor/test_rpc.py

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tests/unittest/executor/test_rpc_worker.py
tensorrt_llm/executor/rpc_proxy.py
tensorrt_llm/executor/rpc.py
tensorrt_llm/executor/worker_base.py
tensorrt_llm/executor/worker.py

📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : all tensorrt-llm open source software code should contain...

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-01T07:34:42.704Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

tensorrt_llm/executor/rpc_proxy.py

📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-01T07:34:42.704Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

tensorrt_llm/executor/rpc_proxy.py

📚 Learning: applies to **/*.{h,hpp} : use a preprocessor guard in header files. the guard name must have prefix ...

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-01T07:34:42.704Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Applied to files:

tensorrt_llm/executor/rpc_proxy.py

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/executor/worker.py

🪛 Ruff (0.12.2)

tensorrt_llm/executor/rpc_proxy.py

32-40: 1 blank line required between summary line and description

(D205)

32-40: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

72-72: Undefined name rpc_worker_main

(F821)

77-78: One-line docstring should fit on one line

Reformat to one line

(D200)

tensorrt_llm/executor/rpc.py

40-41: One-line docstring should fit on one line

Reformat to one line

(D200)

254-255: One-line docstring should fit on one line

Reformat to one line

(D200)

262-266: 1 blank line required between summary line and description

(D205)

440-442: 1 blank line required between summary line and description

(D205)

451-451: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

455-455: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

460-460: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

tensorrt_llm/executor/worker_base.py

21-21: Redefinition of unused logger from line 12

(F811)

192-192: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

239-239: Line too long (138 > 120)

(E501)

261-261: Line too long (153 > 120)

(E501)

267-267: Line too long (132 > 120)

(E501)

325-325: Line too long (165 > 120)

(E501)

453-453: Line too long (121 > 120)

(E501)

634-635: One-line docstring should fit on one line

Reformat to one line

(D200)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

tensorrt_llm/executor/worker.py (1)

234-261: Well-implemented shutdown logic

The shutdown method properly handles thread cleanup with appropriate checks for thread states and includes error handling before final shutdown.

tensorrt_llm/executor/worker_base.py (7)

44-85: LGTM!

The WorkerBase class initialization is well-structured with appropriate instance variable initialization and proper use of weakref to prevent circular references.

86-142: LGTM!

The setup_engine method properly handles device initialization, MPI communication setup, and supports multiple backend configurations.

213-226: Known LoRA adapter cache optimization limitation.

Based on previous learnings, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() has a known race condition issue that requires a comprehensive solution beyond simple error handling.

372-403: LGTM!

The submit method properly enforces rank restrictions and manages the request lifecycle correctly.

411-414: LGTM!

Simple delegation method with appropriate debug logging.

438-452: LGTM!

The stats serialization logic properly handles both iteration and request statistics.

453-462: LGTM!

The queue setter methods properly enforce mutual exclusivity between result queue and postprocessing queues.

tensorrt_llm/executor/rpc_proxy.py

tensorrt_llm/executor/worker_base.py

tensorrt_llm/executor/worker.py

tests/unittest/executor/test_rpc_worker.py

Superjomn · 2025-10-02T09:56:11Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-02T10:01:38Z

PR_Github #20523 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-02T10:01:39Z

PR_Github #20521 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15479 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-10-02T13:11:22Z

PR_Github #20523 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15481 completed with status: 'FAILURE'

Superjomn · 2025-10-02T15:58:02Z

/bot run

tensorrt-cicd · 2025-10-02T16:03:18Z

PR_Github #20542 [ run ] triggered by Bot

Superjomn · 2025-10-02T16:04:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-02T16:10:17Z

PR_Github #20543 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-02T16:10:21Z

PR_Github #20542 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15499 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-10-02T23:49:12Z

PR_Github #20543 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15500 completed with status: 'FAILURE'

Superjomn · 2025-10-03T03:27:26Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-03T03:33:24Z

PR_Github #20566 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-03T09:18:28Z

PR_Github #20566 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15523 completed with status: 'FAILURE'

Signed-off-by: chunweiy <chunweiy@nvidia.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Superjomn · 2025-10-03T14:46:30Z

/bot run

tensorrt-cicd · 2025-10-03T14:52:04Z

PR_Github #20593 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-03T16:57:12Z

PR_Github #20593 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15546 completed with status: 'FAILURE'

Superjomn · 2025-10-04T07:59:04Z

/bot run

tensorrt-cicd · 2025-10-04T08:04:16Z

PR_Github #20628 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-04T14:23:43Z

PR_Github #20628 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15575 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Superjomn marked this pull request as draft June 27, 2025 09:00

Superjomn force-pushed the rpc branch 2 times, most recently from 81d985a to ec65fac Compare July 1, 2025 10:08

Superjomn force-pushed the rpc branch from 8b17e74 to 85307ad Compare July 10, 2025 08:18

Superjomn force-pushed the rpc branch from 85307ad to 5d2561f Compare July 29, 2025 10:59

coderabbitai bot reviewed Jul 29, 2025

View reviewed changes

tensorrt_llm/executor/rpc.py Outdated Show resolved Hide resolved

tensorrt_llm/executor/rpc.py Outdated Show resolved Hide resolved

tensorrt_llm/executor/worker.py Outdated Show resolved Hide resolved

tests/unittest/executor/test_rpc.py Outdated Show resolved Hide resolved

Superjomn force-pushed the rpc branch from 5d2561f to 6c709ce Compare July 30, 2025 07:13

coderabbitai bot requested review from amitz-nv, litaotju, nv-guomingz and shaharmor98 July 30, 2025 07:13

coderabbitai bot reviewed Jul 30, 2025

View reviewed changes

tensorrt_llm/executor/rpc.py Outdated Show resolved Hide resolved

tensorrt_llm/executor/worker_base.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jul 30, 2025

View reviewed changes

Superjomn force-pushed the rpc branch from cbfa09e to a61f1ca Compare July 30, 2025 10:21

coderabbitai bot reviewed Jul 30, 2025

View reviewed changes

tests/unittest/executor/test_worker_base.py Outdated Show resolved Hide resolved

coderabbitai bot requested review from DomBrown and tijyojwad July 30, 2025 11:01

coderabbitai bot reviewed Jul 30, 2025

View reviewed changes

Superjomn force-pushed the rpc branch from 33c3dbb to ddb12ed Compare August 1, 2025 07:38

coderabbitai bot reviewed Aug 1, 2025

View reviewed changes

Superjomn force-pushed the rpc branch 2 times, most recently from b53ed9f to 2a80571 Compare August 20, 2025 12:54

Superjomn removed request for DomBrown, amitz-nv, litaotju, nv-guomingz, shaharmor98 and tijyojwad August 21, 2025 11:42

Superjomn mentioned this pull request Sep 19, 2025

[TRTLLM-8188][chore] refactor GenerationExecutorWorker with WorkerBase for better code reusing #7840

Merged

1 task

Superjomn force-pushed the rpc branch from d659629 to cd7274b Compare October 2, 2025 09:55

Superjomn force-pushed the rpc branch from cd7274b to 397c889 Compare October 2, 2025 16:04

Superjomn force-pushed the rpc branch 2 times, most recently from 00a1419 to 1d97a59 Compare October 3, 2025 03:27

add rpc test list

f13a34e

Signed-off-by: chunweiy <chunweiy@nvidia.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Superjomn force-pushed the rpc branch from 1d97a59 to f13a34e Compare October 3, 2025 14:46

Superjomn enabled auto-merge (squash) October 4, 2025 12:36

Superjomn requested a review from a team October 4, 2025 12:42

Superjomn requested a review from lucaslie October 5, 2025 02:21

juney-nvidia approved these changes Oct 5, 2025

View reviewed changes

Superjomn merged commit fb51de6 into NVIDIA:main Oct 5, 2025
5 checks passed

[TRTLLM-8189][chore] enhance GenerationExecutor with RPC (part1) #5543

[TRTLLM-8189][chore] enhance GenerationExecutor with RPC (part1) #5543

Uh oh!

Conversation

Superjomn commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Roadmap

Control flow

RPC framework

The whole RPCProxy dataflow

The new RPC path coexists with the IPC path

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Superjomn commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

tensorrt-cicd commented Oct 2, 2025

Uh oh!

Superjomn commented Jun 27, 2025 •

edited

Loading

coderabbitai bot commented Jul 29, 2025 •

edited

Loading