-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-8189][chore] enhance GenerationExecutor with RPC (part1) #5543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
81d985a to
ec65fac
Compare
📝 WalkthroughWalkthroughThe changes introduce a new asynchronous RPC framework using ZeroMQ, including both server and client implementations, and add corresponding unit tests. The worker implementation is refactored by extracting a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client as RPCClient
participant Server as RPCServer
participant Worker as WorkerThread
Client->>Server: Send RPC request (ZeroMQ)
Server->>Worker: Dispatch request to worker queue
Worker->>Worker: Execute method (possibly async/threaded)
Worker-->>Server: Place response in response queue
Server->>Client: Send response (ZeroMQ)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
tensorrt_llm/executor/worker.py (1)
684-832: Significant code duplication in _enqueue_request.There's substantial duplication between
WorkerBase._enqueue_request(lines 190-341) andGenerationExecutorWorker._enqueue_request(lines 684-832). The main differences are:
- LoRA adapter caching optimization in GenerationExecutorWorker
- Multimodal parameter handling differences
- Some parameter differences for PyTorch backend
Consider refactoring to reduce duplication:
class WorkerBase: def _enqueue_request(self, request: GenerationRequest) -> int: # Common validation assert request.id is not None # Allow subclasses to customize LoRA loading lora_config = self._prepare_lora_config(request) # Common token preparation prompt_token_ids = copy.deepcopy(request.prompt_token_ids) prompt_tuning_config = self._prepare_prompt_tuning(request, prompt_token_ids) # Allow subclasses to customize multimodal handling multimodal_params = self._prepare_multimodal_params(request) # Rest of common logic... def _prepare_lora_config(self, request): """Override in subclasses for custom LoRA handling.""" # Default implementation def _prepare_multimodal_params(self, request): """Override in subclasses for custom multimodal handling.""" # Default implementation
🧹 Nitpick comments (3)
tests/unittest/executor/test_rpc.py (1)
22-22: Consider making sleep times configurable or adaptive.Fixed sleep times can cause flaky tests on slower systems or waste time on faster systems.
Consider using a more robust approach:
import time def wait_for_server_ready(server, timeout=5.0, check_interval=0.1): """Wait for server to be ready with timeout.""" start_time = time.time() while time.time() - start_time < timeout: try: # Try to connect or check server status if hasattr(server, 'is_ready') and server.is_ready(): return True time.sleep(check_interval) except Exception: time.sleep(check_interval) raise TimeoutError("Server did not become ready in time")Also applies to: 37-37, 51-51, 68-68, 85-85, 99-99, 124-124, 127-127, 146-146, 180-180
tensorrt_llm/executor/rpc.py (2)
109-127: Consider thread-safe shutdown.The shutdown method modifies shared state without synchronization. While the GIL provides some protection, explicit synchronization would be safer.
Consider using a lock for thread-safe shutdown:
def __init__(self, ...): + self._shutdown_lock = threading.Lock() ... def shutdown(self): """Internal method to trigger server shutdown.""" + with self._shutdown_lock: + if self._client_socket is None: + return # Already shut down logger.debug( "RPC Server shutdown signal received. Terminating server...") ...
39-40: Minor: Fix docstring formatting.Single-line docstrings should fit on one line per PEP 257.
- """ - An RPC Server that listens for requests and executes them concurrently. - """ + """An RPC Server that listens for requests and executes them concurrently.""" - """ - An RPC Client that connects to the RPCServer. - """ + """An RPC Client that connects to the RPCServer."""Also applies to: 235-236
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tensorrt_llm/executor/ipc.py(2 hunks)tensorrt_llm/executor/rpc.py(1 hunks)tensorrt_llm/executor/worker.py(7 hunks)tensorrt_llm/llmapi/utils.py(1 hunks)tests/unittest/executor/test_rpc.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/llmapi/utils.pytensorrt_llm/executor/ipc.pytests/unittest/executor/test_rpc.pytensorrt_llm/executor/rpc.pytensorrt_llm/executor/worker.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/llmapi/utils.pytensorrt_llm/executor/ipc.pytests/unittest/executor/test_rpc.pytensorrt_llm/executor/rpc.pytensorrt_llm/executor/worker.py
🧠 Learnings (2)
tests/unittest/executor/test_rpc.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tensorrt_llm/executor/worker.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/executor/rpc.py
39-40: One-line docstring should fit on one line
Reformat to one line
(D200)
233-234: One-line docstring should fit on one line
Reformat to one line
(D200)
241-245: 1 blank line required between summary line and description
(D205)
395-397: 1 blank line required between summary line and description
(D205)
406-406: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
410-410: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
417-417: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
tensorrt_llm/executor/worker.py
49-53: 1 blank line required between summary line and description
(D205)
96-96: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
132-132: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
347-347: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
353-355: 1 blank line required between summary line and description
(D205)
382-382: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
387-387: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
392-392: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
461-461: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
509-512: 1 blank line required between summary line and description
(D205)
🔇 Additional comments (9)
tensorrt_llm/llmapi/utils.py (1)
246-253: LGTM! Good enhancement for thread lifecycle management.The addition of the optional
stop_eventparameter provides flexibility for external thread control while maintaining backward compatibility. This is particularly useful for coordinating shutdown across multiple threads using a shared event.tensorrt_llm/executor/ipc.py (1)
1-1: LGTM! Useful addition for non-blocking async operations.The new
get_async_noblockmethod provides a clean way to perform async receives with timeout, which is essential for the RPC framework's non-blocking communication patterns.Also applies to: 200-202
tests/unittest/executor/test_rpc.py (1)
168-192: Good test coverage with parameterization!The benchmark test effectively covers different configurations (async/sync, IPC/TCP) with proper assertions. The performance measurement is useful for regression testing.
tensorrt_llm/executor/rpc.py (2)
280-306: Good error handling in response reader!The response reader properly handles cancellation, propagates exceptions to pending futures, and cleans up when done. The error propagation pattern is well implemented.
403-431: Clever use of dynamic proxy pattern!The
__getattr__implementation withMethodProxyprovides an elegant API that supports multiple calling styles (sync, async, future). This makes the client very user-friendly.tensorrt_llm/executor/worker.py (4)
48-54: Excellent refactoring with WorkerBase extraction!The introduction of
WorkerBaseprovides a clean separation between core worker logic and transport/threading concerns. This modular design will make it easier to implement different worker types (e.g., RPC-based workers) in the future.
571-584: Good defensive programming with queue checks!The additional checks for queue availability and methods prevent crashes when queues are not properly initialized. The early return pattern is clean.
844-869: Good override of submit method.The override properly calls the parent class logic while passing
selfas the executor, maintaining the correct inheritance chain.
893-893: Clean shutdown delegation.Using
_basic_shutdown()from the base class for engine shutdown while handling thread management in the subclass is a good separation of concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
tensorrt_llm/executor/worker.py (1)
118-126: Add validation for client_id in return_queue.While the method now checks
result_queue, it still assumesclient_idexists in_resultsdictionary when accessing it on line 125.def return_queue(self, client_id: int): """ If a centralized result queue is registered (used for communication with the proxy) send the message there. Otherwise, push the result directly in the GenerationResult queue. """ if self.result_queue is not None: return self.result_queue + if client_id not in self._results: + raise KeyError(f"No result found for client_id {client_id}") return self._results[client_id].queue
♻️ Duplicate comments (2)
tensorrt_llm/executor/rpc.py (2)
84-88: Add validation for socket initialization.The address property assumes
_client_socketis not None, but this could be accessed beforebind()is called.
263-274: Potential race condition in client cleanup.The
__del__method callsclose()which cancels_reader_task, but the task might still be executing when the client is garbage collected.
🧹 Nitpick comments (2)
tensorrt_llm/executor/rpc.py (1)
38-42: Fix docstring formatting for consistency.Several docstrings should be reformatted according to the project's style guide.
class RPCServer: - """ - An RPC Server that listens for requests and executes them concurrently. - """ + """An RPC Server that listens for requests and executes them concurrently."""Also apply similar fixes to:
- Line 235-238: RPCClient docstring
- Line 245-250: init docstring
- Line 344-359: call_async docstring
- Line 405-407: getattr docstring
tensorrt_llm/executor/worker_base.py (1)
211-233: Fix long lines for better readability.Several lines exceed the 120 character limit, making the code harder to read.
For example, line 211 can be split:
- # NOTE: Since, we only support PyTorch backend for multimodal, we will send multimodal_data through the 'py_multimodal_data' field + # NOTE: Since we only support PyTorch backend for multimodal, we will send multimodal_data + # through the 'py_multimodal_data' fieldSimilar fixes should be applied to lines 233, 239, and 297.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tensorrt_llm/executor/ipc.py(2 hunks)tensorrt_llm/executor/rpc.py(1 hunks)tensorrt_llm/executor/worker.py(3 hunks)tensorrt_llm/executor/worker_base.py(1 hunks)tensorrt_llm/llmapi/utils.py(1 hunks)tests/unittest/executor/test_rpc.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/llmapi/utils.py
- tensorrt_llm/executor/ipc.py
- tests/unittest/executor/test_rpc.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/executor/worker_base.pytensorrt_llm/executor/rpc.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/executor/worker_base.pytensorrt_llm/executor/rpc.py
🧠 Learnings (3)
tensorrt_llm/executor/worker.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/executor/worker_base.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/executor/rpc.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/executor/worker_base.py
164-164: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
211-211: Line too long (138 > 120)
(E501)
233-233: Line too long (153 > 120)
(E501)
239-239: Line too long (132 > 120)
(E501)
297-297: Line too long (165 > 120)
(E501)
393-393: Undefined name GenerationExecutorWorker
(F821)
531-532: One-line docstring should fit on one line
Reformat to one line
(D200)
tensorrt_llm/executor/rpc.py
39-40: One-line docstring should fit on one line
Reformat to one line
(D200)
233-234: One-line docstring should fit on one line
Reformat to one line
(D200)
241-245: 1 blank line required between summary line and description
(D205)
395-397: 1 blank line required between summary line and description
(D205)
406-406: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
410-410: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
417-417: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tensorrt_llm/executor/worker.py (1)
124-131: Add validation for client_id existence.The
return_queuemethod could raise aKeyErrorifclient_iddoesn't exist in_results.Add validation:
def return_queue(self, client_id: int): """ If a centralized result queue is registered (used for communication with the proxy) send the message there. Otherwise, push the result directly in the GenerationResult queue. """ if self.result_queue is not None: return self.result_queue + if client_id not in self._results: + raise KeyError(f"No result found for client_id {client_id}") return self._results[client_id].queuetensorrt_llm/executor/worker_base.py (1)
395-395: Fix undefined name in type hint.The type hint references
GenerationExecutorWorkerwhich is not imported in this file.- def __init__(self, worker: "GenerationExecutorWorker"): + def __init__(self, worker: "WorkerBase"):
🧹 Nitpick comments (2)
tensorrt_llm/executor/worker_base.py (2)
69-70: Address TODO: Consider using weakref for circular reference prevention.The TODO comment suggests making the worker reference a weakref to prevent potential circular references.
Would you like me to implement the weakref pattern to prevent potential memory leaks from circular references between
WorkerBaseandAwaitResponseHelper?
533-534: Minor: Reformat docstring to single line.The docstring can fit on a single line for better readability.
- """Compute logprob and prompt logprob and clear out logits if applicable. - """ + """Compute logprob and prompt logprob and clear out logits if applicable."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/executor/worker.py(3 hunks)tensorrt_llm/executor/worker_base.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/executor/worker_base.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/executor/worker_base.py
🧠 Learnings (2)
tensorrt_llm/executor/worker.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/executor/worker_base.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/executor/worker_base.py
166-166: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
213-213: Line too long (138 > 120)
(E501)
235-235: Line too long (153 > 120)
(E501)
241-241: Line too long (132 > 120)
(E501)
299-299: Line too long (165 > 120)
(E501)
395-395: Undefined name GenerationExecutorWorker
(F821)
533-534: One-line docstring should fit on one line
Reformat to one line
(D200)
🔇 Additional comments (4)
tensorrt_llm/executor/worker.py (2)
42-75: Clean refactoring of the constructor.The constructor has been properly simplified to delegate initialization to
WorkerBaseand maintains the necessary initialization sequence.
257-279: Proper delegation to parent shutdown.The shutdown method correctly maintains thread management while delegating core shutdown logic to
WorkerBase.tensorrt_llm/executor/worker_base.py (2)
339-342: Helpful error message for common MPI usage mistake.The error message provides clear guidance on how to fix the common mistake of calling generate from non-rank-0 processes.
403-430: Clean handler pattern implementation.The use of enum-based handler kinds with pattern matching provides a clean and extensible design for different response handling modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tensorrt_llm/executor/worker.py (1)
116-123: Missing None check for queue access.The
return_queuemethod assumesclient_idexists in_resultswithout validation, which could cause a KeyError.def return_queue(self, client_id: int): """ If a centralized result queue is registered (used for communication with the proxy) send the message there. Otherwise, push the result directly in the GenerationResult queue. """ if self.result_queue is not None: return self.result_queue + if client_id not in self._results: + raise KeyError(f"No result found for client_id {client_id}") return self._results[client_id].queue
🧹 Nitpick comments (7)
tests/unittest/executor/test_worker_base.py (3)
24-24: Remove unused variable assignment.The
workervariable is assigned but never used in the test. Since the test only verifies that the WorkerBase can be instantiated, you can simplify it.def test_create_engine(self): - with WorkerBase(engine=model_path) as worker: + with WorkerBase(engine=model_path): pass
43-46: Consider making the test more deterministic.The test uses a fixed loop count and only checks if the queue is non-empty. Consider making it more specific about expected outcomes.
- for i in range(10): + # Wait for responses until we get a final response or timeout + max_iterations = 20 + for i in range(max_iterations): worker.await_responses() + if result_queue.qsize() > 0: + break assert result_queue.qsize() > 0 + # Optionally, verify the response content + response = result_queue.get() + assert response is not None
48-67: Remove unused helper method.The
_create_executor_configmethod is defined but never used in the tests. Consider removing it to keep the test file clean.If this method is intended for future use, consider adding a test that uses it or document why it's needed.
tensorrt_llm/executor/worker_base.py (4)
81-137: Comprehensive engine creation with multiple backend support.The create_engine method properly handles different backend types and MPI configuration. Consider breaking this into smaller methods in future refactoring for better maintainability.
234-234: Consider breaking long lines for better readability.Several lines exceed the 120 character limit. While not critical, consider breaking them up for better readability.
For example, line 234 could be split:
- # NOTE: Since, we only support PyTorch backend for multimodal, we will send multimodal_data through the 'py_multimodal_data' field + # NOTE: Since, we only support PyTorch backend for multimodal, we will send + # multimodal_data through the 'py_multimodal_data' fieldAlso applies to: 256-256, 262-262, 320-320, 398-398
187-187: Add closing punctuation to docstring.The docstring should end with proper punctuation.
- """Returns True if the adapter was loaded by this call, False if it was already loaded""" + """Returns True if the adapter was loaded by this call, False if it was already loaded."""
571-572: Reformat one-line docstring.The docstring should fit on a single line per PEP 257.
- """Compute logprob and prompt logprob and clear out logits if applicable. - """ + """Compute logprob and prompt logprob and clear out logits if applicable."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/executor/worker.py(4 hunks)tensorrt_llm/executor/worker_base.py(1 hunks)tests/unittest/executor/test_worker_base.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/executor/test_worker_base.pytensorrt_llm/executor/worker.pytensorrt_llm/executor/worker_base.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/unittest/executor/test_worker_base.pytensorrt_llm/executor/worker.pytensorrt_llm/executor/worker_base.py
🧠 Learnings (3)
tests/unittest/executor/test_worker_base.py (2)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/executor/worker.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
tensorrt_llm/executor/worker_base.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tests/unittest/executor/test_worker_base.py
24-24: Local variable worker is assigned to but never used
Remove assignment to unused variable worker
(F841)
tensorrt_llm/executor/worker_base.py
187-187: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
234-234: Line too long (138 > 120)
(E501)
256-256: Line too long (153 > 120)
(E501)
262-262: Line too long (132 > 120)
(E501)
320-320: Line too long (165 > 120)
(E501)
398-398: Line too long (121 > 120)
(E501)
571-572: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/executor/worker.py (2)
37-73: Clean refactoring to inherit from WorkerBase.The refactoring properly delegates core functionality to the WorkerBase parent class while maintaining worker-specific thread management. Good separation of concerns.
112-114: Proper delegation to parent class submit method.The submit method correctly ensures threads are started before delegating to the parent's implementation.
tensorrt_llm/executor/worker_base.py (3)
39-80: Well-structured base class initialization.The WorkerBase class is well-designed with clear separation of responsibilities and proper initialization of core components. Good use of weakref to avoid circular references.
208-218: Known race condition in LoRA adapter cache optimization.This code maintains the existing LoRA cache optimization logic with its known race condition limitation, as documented in previous reviews. This is acceptable as a more comprehensive solution would require significant architectural changes.
427-461: Well-designed response handling with multiple modes.The AwaitResponseHelper class provides a clean abstraction for different response handling modes (single process vs IPC batched). The type hint is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (3)
tensorrt_llm/executor/rpc_proxy.py (3)
33-42: Fix docstring formattingThe docstring doesn't follow proper formatting guidelines as indicated by static analysis.
- """ + """Initialize the RPC proxy executor. + Args: worker_kwargs: kwargs for the rpc worker model_world_size: the world size of the model mpi_session: the mpi session to use postproc_worker_config: the postproc worker config is_llm_executor: whether this is an llm executor garbage_collection_gen0_threshold: the garbage collection gen0 threshold - clock_unit: the unit of the clock, 1 means 1 second + clock_unit: the unit of the clock, 1 means 1 second. """
89-90: Incomplete implementations with TODOsSeveral methods have TODO comments or raise NotImplementedError, indicating incomplete functionality:
- Line 89:
get_stats_remote()has TODO- Line 131:
handle_stats()raises NotImplementedError- Line 141:
create_engine_remote()has TODOWould you like me to help implement these methods or create issues to track their completion?
Also applies to: 130-131, 141-141
101-129: Add type annotations for better code clarityThe method lacks return type annotation and the nested function could benefit from type hints.
- def handle_responses(self, responses: list[GenerationResult]) -> bool: + def handle_responses(self, responses: list[GenerationResult]) -> None: async_queues = [] event_loop = None - def process_res(res): + def process_res(res: GenerationResult) -> None:
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/executor/rpc_proxy.py(1 hunks)tensorrt_llm/executor/rpc_worker.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/executor/rpc_proxy.pytensorrt_llm/executor/rpc_worker.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/executor/rpc_proxy.pytensorrt_llm/executor/rpc_worker.py
🧠 Learnings (1)
tensorrt_llm/executor/rpc_worker.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/executor/rpc_proxy.py
33-41: 1 blank line required between summary line and description
(D205)
33-41: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
78-79: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/executor/rpc_worker.py (1)
14-30: Well-designed RPC worker implementationThe
RpcWorkerclass cleanly extendsWorkerBasewith minimal RPC-specific functionality. The shutdown event coordination pattern is appropriate for managing the lifecycle between the worker and RPC service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🔭 Outside diff range comments (1)
tests/unittest/executor/test_rpc_worker.py (1)
97-100: Use unittest test runner for proper test executionReplace the manual test execution with unittest's test runner for better test discovery and reporting.
if __name__ == '__main__': - worker = TestRpcWorker() - worker.test_main() + unittest.main()
♻️ Duplicate comments (9)
tests/unittest/executor/test_rpc.py (1)
17-17: Use unique IPC paths to avoid test conflictsHard-coded IPC paths can cause conflicts when tests run in parallel.
Also applies to: 35-35, 49-49, 66-66, 83-83, 97-97, 122-122, 145-145, 176-176
tensorrt_llm/executor/rpc_proxy.py (4)
1-1: Missing NVIDIA copyright headerAll TensorRT-LLM source files should contain an NVIDIA copyright header.
43-46: Store worker_kwargs and clock_unit as instance variablesThe parameters are used later but not stored as instance variables.
GenerationExecutorRpcProxy.INSTANCE_COUNTER += 1 self.rpc_addr = self.gen_uniq_rpc_addr() self.rpc_client = RPCClient(self.rpc_addr) + self.worker_kwargs = worker_kwargs + self.clock_unit = clock_unit
72-77: Undefined name 'rpc_worker_main' and AttributeError for worker_kwargsThe function
rpc_worker_mainis not imported or defined, andself.worker_kwargsis not defined.Additionally, you need to import the missing function:
+from .rpc_worker import rpc_worker_main
145-158: Method should return the created MPI sessionThe method assigns to
self.mpi_sessionbut doesn't return it, yet it's called with assignment.def _create_mpi_session(self, model_world_size: int, mpi_session: Optional[MpiSession]): mpi_process_pre_spawned: bool = get_spawn_proxy_process_env() if mpi_session is None: if mpi_process_pre_spawned: print_colored_debug('create comm session ...\n', "yellow") self.mpi_session = create_mpi_comm_session(model_world_size) else: print_colored_debug('create pool session ...\n', "yellow") self.mpi_session = MpiPoolSession(n_workers=model_world_size) else: print_colored_debug('using external mpi session ...\n', "yellow") self.mpi_session = mpi_session + return self.mpi_sessiontensorrt_llm/executor/rpc.py (3)
85-88: Add validation for socket initializationThe address property assumes
_client_socketis not None.
284-295: Potential race condition in client cleanupThe
__del__method callsclose()which cancels_reader_task, but the task might still be executing.
384-387: Handle nested event loops properlyThe
_call_syncmethod usesasyncio.run()which will fail if called from within an existing event loop.tensorrt_llm/executor/worker.py (1)
116-124: Add validation for client_id in return_queueThe method assumes
client_idexists in_results.
🧹 Nitpick comments (6)
tests/unittest/executor/test_rpc_worker.py (1)
63-64: Address the known shutdown hang issueThe comments indicate a known issue where responses must be fetched before shutdown to avoid hanging. This should be properly addressed or tracked.
This known issue could lead to test flakiness and production problems. Would you like me to:
- Open an issue to track this problem?
- Implement a proper solution that ensures all responses are fetched before shutdown?
- Add a timeout mechanism to prevent indefinite hanging?
Also applies to: 80-81
tests/unittest/executor/test_rpc.py (1)
231-232: Make timeout error assertion more specificThe assertion for timeout error message should be more specific to ensure the correct error type is raised.
- assert "timed out" in str( - exc_info.value), f"Timeout message not found: {exc_info.value}" + assert "timed out" in str(exc_info.value) or "timeout" in str(exc_info.value).lower(), \ + f"Timeout message not found in error: {exc_info.value}" + # Also verify it's an RPCError or RPCTimeout + assert isinstance(exc_info.value, RPCError), \ + f"Expected RPCError, got {type(exc_info.value)}"tensorrt_llm/executor/rpc_proxy.py (1)
78-93: Improve timing logic clarity in main loopThe modulo operations with clock counter seem arbitrary. Consider using more descriptive timing mechanisms.
def main_loop_task(self): """ Main loop of the proxy, it will invoke the actions periodically. """ - clock = 0 + last_stats_time = time.time() + stats_interval = 10 * self.clock_unit # Fetch stats every 10 clock units + while not self._shutdown_event.is_set(): - if clock % 1 == 0: - responses = self.await_responses_remote() - self.handle_responses(responses) - if clock % 10 == 0: + # Always fetch responses + responses = self.await_responses_remote() + self.handle_responses(responses) + + # Fetch stats periodically + current_time = time.time() + if current_time - last_stats_time >= stats_interval: stats = self.get_stats_remote() # TODO self.handle_stats(stats) + last_stats_time = current_time - clock += 1 time.sleep(self.clock_unit)tensorrt_llm/executor/rpc.py (1)
179-202: Simplify duplicated executor logicThe async and sync execution branches have nearly identical code that can be consolidated.
try: - if self._executor is not None: - # Dispatch to worker thread and await result with timeout - loop = asyncio.get_running_loop() - - # Create a wrapper function to handle keyword arguments - def call_with_kwargs(): - return self._functions[req.method_name]( - *req.args, **req.kwargs) - - result = await asyncio.wait_for(loop.run_in_executor( - self._executor, call_with_kwargs), - timeout=req.timeout) - else: - # For synchronous execution, we need to run in executor to support timeout - loop = asyncio.get_running_loop() - - # Create a wrapper function to handle keyword arguments - def call_with_kwargs(): - return self._functions[req.method_name]( - *req.args, **req.kwargs) - - result = await asyncio.wait_for(loop.run_in_executor( - None, call_with_kwargs), - timeout=req.timeout) + loop = asyncio.get_running_loop() + + # Create a wrapper function to handle keyword arguments + def call_with_kwargs(): + return self._functions[req.method_name]( + *req.args, **req.kwargs) + + # Use the configured executor or None for default thread pool + executor = self._executor if self._executor is not None else None + result = await asyncio.wait_for( + loop.run_in_executor(executor, call_with_kwargs), + timeout=req.timeout + ) response = RPCResponse(req.request_id, 'OK', result)tensorrt_llm/executor/worker_base.py (2)
192-193: Fix docstring formatting.The docstring should end with a period according to Python docstring conventions.
- """Returns True if the adapter was loaded by this call, False if it was already loaded""" + """Returns True if the adapter was loaded by this call, False if it was already loaded."""
632-636: Fix docstring formatting.The one-line docstring should fit on a single line.
-def _get_logprobs(worker, - response: tllm.Response, - is_pytorch_backend=False) -> Optional[LogProbsResult]: - """Compute logprob and prompt logprob and clear out logits if applicable. - """ +def _get_logprobs(worker, + response: tllm.Response, + is_pytorch_backend=False) -> Optional[LogProbsResult]: + """Compute logprob and prompt logprob and clear out logits if applicable."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
tensorrt_llm/executor/ipc.py(2 hunks)tensorrt_llm/executor/rpc.py(1 hunks)tensorrt_llm/executor/rpc_proxy.py(1 hunks)tensorrt_llm/executor/rpc_worker.py(1 hunks)tensorrt_llm/executor/worker.py(6 hunks)tensorrt_llm/executor/worker_base.py(1 hunks)tensorrt_llm/llmapi/utils.py(1 hunks)tests/unittest/executor/test_rpc.py(1 hunks)tests/unittest/executor/test_rpc_worker.py(1 hunks)tests/unittest/executor/test_worker_base.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tensorrt_llm/executor/ipc.py
- tensorrt_llm/llmapi/utils.py
- tensorrt_llm/executor/rpc_worker.py
- tests/unittest/executor/test_worker_base.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/unittest/executor/test_rpc_worker.pytensorrt_llm/executor/rpc_proxy.pytensorrt_llm/executor/rpc.pytensorrt_llm/executor/worker_base.pytests/unittest/executor/test_rpc.pytensorrt_llm/executor/worker.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/unittest/executor/test_rpc_worker.pytensorrt_llm/executor/rpc_proxy.pytensorrt_llm/executor/rpc.pytensorrt_llm/executor/worker_base.pytests/unittest/executor/test_rpc.pytensorrt_llm/executor/worker.py
🧠 Learnings (6)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/executor/test_rpc_worker.pytests/unittest/executor/test_rpc.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tests/unittest/executor/test_rpc_worker.pytensorrt_llm/executor/rpc_proxy.pytensorrt_llm/executor/rpc.pytensorrt_llm/executor/worker_base.pytensorrt_llm/executor/worker.py
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : all tensorrt-llm open source software code should contain...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-01T07:34:42.704Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
tensorrt_llm/executor/rpc_proxy.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-01T07:34:42.704Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tensorrt_llm/executor/rpc_proxy.py
📚 Learning: applies to **/*.{h,hpp} : use a preprocessor guard in header files. the guard name must have prefix ...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-01T07:34:42.704Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Applied to files:
tensorrt_llm/executor/rpc_proxy.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/executor/worker.py
🪛 Ruff (0.12.2)
tensorrt_llm/executor/rpc_proxy.py
32-40: 1 blank line required between summary line and description
(D205)
32-40: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
72-72: Undefined name rpc_worker_main
(F821)
77-78: One-line docstring should fit on one line
Reformat to one line
(D200)
tensorrt_llm/executor/rpc.py
40-41: One-line docstring should fit on one line
Reformat to one line
(D200)
254-255: One-line docstring should fit on one line
Reformat to one line
(D200)
262-266: 1 blank line required between summary line and description
(D205)
440-442: 1 blank line required between summary line and description
(D205)
451-451: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
455-455: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
460-460: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
tensorrt_llm/executor/worker_base.py
21-21: Redefinition of unused logger from line 12
(F811)
192-192: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
239-239: Line too long (138 > 120)
(E501)
261-261: Line too long (153 > 120)
(E501)
267-267: Line too long (132 > 120)
(E501)
325-325: Line too long (165 > 120)
(E501)
453-453: Line too long (121 > 120)
(E501)
634-635: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/executor/worker.py (1)
234-261: Well-implemented shutdown logicThe shutdown method properly handles thread cleanup with appropriate checks for thread states and includes error handling before final shutdown.
tensorrt_llm/executor/worker_base.py (7)
44-85: LGTM!The
WorkerBaseclass initialization is well-structured with appropriate instance variable initialization and proper use of weakref to prevent circular references.
86-142: LGTM!The
setup_enginemethod properly handles device initialization, MPI communication setup, and supports multiple backend configurations.
213-226: Known LoRA adapter cache optimization limitation.Based on previous learnings, the LoRA adapter cache optimization logic that checks
is_adapter_in_cpu_cache()has a known race condition issue that requires a comprehensive solution beyond simple error handling.
372-403: LGTM!The
submitmethod properly enforces rank restrictions and manages the request lifecycle correctly.
411-414: LGTM!Simple delegation method with appropriate debug logging.
438-452: LGTM!The stats serialization logic properly handles both iteration and request statistics.
453-462: LGTM!The queue setter methods properly enforce mutual exclusivity between result queue and postprocessing queues.
b53ed9f to
2a80571
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #20523 [ run ] triggered by Bot |
|
PR_Github #20521 [ run ] completed with state |
|
PR_Github #20523 [ run ] completed with state |
|
/bot run |
|
PR_Github #20542 [ run ] triggered by Bot |
|
/bot run --disable-fail-fast |
|
PR_Github #20543 [ run ] triggered by Bot |
|
PR_Github #20542 [ run ] completed with state |
|
PR_Github #20543 [ run ] completed with state |
00a1419 to
1d97a59
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #20566 [ run ] triggered by Bot |
|
PR_Github #20566 [ run ] completed with state |
Signed-off-by: chunweiy <chunweiy@nvidia.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
|
/bot run |
|
PR_Github #20593 [ run ] triggered by Bot |
|
PR_Github #20593 [ run ] completed with state |
|
/bot run |
|
PR_Github #20628 [ run ] triggered by Bot |
|
PR_Github #20628 [ run ] completed with state |
This PR adds RPC support to the orchestrator to improve robustness and simplify the codebase.
The design is outlined in these slides.
Changes
1. Lightweight RPC module
Introduces a basic RPC implementation tailored to our current IPC use cases.
Supports three remote call modes:
rpc_client.some_func(args...).remote(rpc_args...)await rpc_client.some_async_func(args...).remote(rpc_args...)async for x in rpc_client.some_async_iter_func(args...).remote(rpc_args...)Designed to align with existing IPC logic for easier performance parity.
Pure Python implementation, simple to optimize.
Native
picklesupport ensures compatibility.Reuses existing components (e.g.,
ZeroIpcQueuewith HMAC).Built-in timeout and error propagation.
RPCTimeoutif they exceed the threshold.Can be reused in other modules that require IPC.
2. RPC wrappers for
GenerationExecutorRpcWorker: Inherits fromBaseWorker, exposing methods suitable for remote calls (e.g.,fetch_responses_loop_async).main_task, similar toGenerationExecutorWorker.worker_main, but initializes anRPCServer.RpcProxy: Manages a singlemain_loop_threadfor receiving responses fromRpcWorker.3. Coexistence with existing IPC
Adds
orchestrator_typetoLlmArgs. Default behavior remains unchanged."RPC": Enables the new RPC path using MPI."Ray": Activates the Ray-based path.The RPC path will eventually replace the existing IPC once feature and performance parity is achieved.
Roadmap
Control flow
RPC framework
The whole RPCProxy dataflow
The new RPC path coexists with the IPC path
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Refactor
Tests