-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][opt] Balance the request based on number of tokens in AttentionDP #7183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughExecutorRequestQueue.fetch_new_requests signature changed to accept a list of LlmRequest (activate_requests). Attention-DP paths now compute per-rank active token counts, allgather both request and token arrays, and propagate token-aware scheduling and balancing through fetch, schedule, and balance functions. Tests updated to exercise token-aware logic. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant ExecQueue as ExecutorRequestQueue
participant Dist as Dist(tp_allgather)
participant Scheduler as Attention-DP Scheduler
participant Balancer
Caller->>ExecQueue: fetch_new_requests(activate_requests)
ExecQueue->>ExecQueue: compute local_num_tokens = sum(req.input_token_ids)
ExecQueue->>Dist: allgather([len(activate_requests), local_num_tokens])
Dist-->>ExecQueue: all_ranks_num_active_requests, all_ranks_num_active_tokens
ExecQueue->>Scheduler: _schedule_attention_dp_requests(activate_requests, all_ranks_num_active_requests, all_ranks_num_active_tokens)
Scheduler->>Balancer: _balance_requests_across_ranks(requests, capacities, all_ranks_num_active_requests, all_ranks_num_active_tokens)
Balancer-->>Scheduler: assignments (token-aware)
Scheduler-->>ExecQueue: activation plan
ExecQueue-->>Caller: activated requests
note right of Balancer: New: heap init & updates use per-rank token counts
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
/bot run --disable-fail-fast |
|
PR_Github #16275 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
516-530: Ensure empty check before heap operations.The heap operations could fail if
all_ranks_new_requests_heapis empty after filtering.heapq.heapify(all_ranks_new_requests_heap) # Sort by token count (descending) for better load balancing new_requests = sorted( new_requests, key=lambda x: len(getattr(x.request, 'input_token_ids', [])) if x.request else 0, reverse=True) # Distribute requests across ranks for req_item in new_requests: - + if not all_ranks_new_requests_heap: + break val = heapq.heappop(all_ranks_new_requests_heap) token_count = len( getattr(req_item.request, 'input_token_ids', [])) if req_item.request else 0
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py(8 hunks)tests/unittest/_torch/executor/test_executor_request_queue.py(20 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/executor_request_queue.pytests/unittest/_torch/executor/test_executor_request_queue.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/executor_request_queue.pytests/unittest/_torch/executor/test_executor_request_queue.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (17)
tests/unittest/_torch/executor/test_executor_request_queue.py (11)
456-459: LGTM! Token tracking fixture added correctly.The new
all_ranks_num_active_tokensfixture properly complements the existingall_ranks_num_active_requestsfixture and supports the token-aware scheduling feature being tested.
484-500: Test parameter updated correctly for token-aware scheduling.The test now properly passes the
all_ranks_num_active_tokensparameter to_schedule_attention_dp_requests, aligning with the production code's new token-aware scheduling logic.
662-673: Test correctly validates token parameter propagation.The test properly verifies that
_balance_requests_across_ranksis called with both the request counts and token counts parameters.
869-962: Test cases comprehensively cover token-based distribution scenarios.The parameterized test cases thoroughly validate the token-aware distribution logic across various scenarios including balanced distribution, capacity limits, targeted vs. relaxed requests, and mixed cases.
995-1001: Good integration test enhancement for token tracking.The test correctly creates mock token counts and passes them through the scheduling workflow, ensuring proper integration of the token-aware scheduling feature.
1011-1028: Comprehensive test for empty requests scenario.The test properly validates that
_balance_requests_across_rankshandles empty request lists correctly while maintaining the per-rank structure.
1030-1053: Test validates single request assignment to lowest token count rank.The test correctly verifies that a single request is assigned to the rank with the lowest active token count (rank 2), demonstrating proper heap-based load balancing.
1055-1098: Well-structured test for multi-request load balancing.The test correctly validates that multiple requests are distributed based on token counts and current rank loads. The expected distribution pattern follows the heap-based algorithm correctly.
1100-1130: Good test for capacity limit enforcement.The test properly validates that the balancing algorithm respects the
expected_num_active_requestscapacity limit for each rank.
1132-1175: Test validates heap ordering correctly.The test properly verifies that requests are distributed based on heap ordering (lowest active count first) when token counts are equal.
1177-1217: Excellent test for token count sorting behavior.The test correctly validates that requests are sorted by token count in descending order before distribution, ensuring proper load balancing across ranks.
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (6)
331-342: LGTM! Token-aware scheduling implementation looks correct.The implementation properly:
- Collects token counts from active requests
- Uses
tp_allgatherto distribute both request counts and token counts across ranks- Unpacks and stores the gathered data correctly
355-357: Good integration of token counts into scheduling.The token counts are properly passed to
_schedule_attention_dp_requestsfor token-aware load balancing.
373-376: Method signature correctly updated for token-aware scheduling.The additional
all_ranks_num_active_tokensparameter enables token-based load balancing in the scheduling logic.
421-423: Token counts properly integrated into balancing logic.The token counts are correctly passed to
_balance_requests_across_ranksfor improved load distribution.
478-482: Method signature updated for token-aware balancing.The additional
all_ranks_num_active_tokensparameter enables the balancing algorithm to consider token loads when distributing requests.
490-491: Good initialization of heap with token counts.Using actual token counts instead of 0 for initial heap values provides better load balancing from the start.
|
PR_Github #16275 [ run ] completed with state |
|
/bot run --disable-fail-fast |
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
|
/bot run --disable-fail-fast |
|
PR_Github #16501 [ run ] triggered by Bot |
|
PR_Github #16502 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)
392-393: Sorting contradicts intent; non-relax requests should be prioritized.Comment says “Prioritize the requests that are not in relax mode,” but reverse=True places relaxed (True) before non-relaxed (False). Drop reverse=True (or invert the key).
- new_requests = sorted(new_requests, key=get_relax_value, reverse=True) + new_requests = sorted(new_requests, key=get_relax_value)
401-407: Token-aware balancing should account for targeted assignments too.When a request is scheduled to its target_dp_rank, you increment all_ranks_num_active_requests but not all_ranks_num_active_tokens. This skews the subsequent heap-based token balancing.
if target_dp_rank is not None and all_ranks_num_active_requests[ target_dp_rank] < self.max_num_active_requests: all_ranks_num_active_requests[target_dp_rank] += 1 scheduled = True all_ranks_new_requests[target_dp_rank].append(req_item) + # Keep token load in sync for fair balancing later + token_count = 0 + if req_item.request is not None: + token_count = len(getattr(req_item.request, "input_token_ids", [])) + all_ranks_num_active_tokens[target_dp_rank] += token_count
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
305-312: Public API change: document and align naming.The fetch_new_requests signature now accepts a List[LlmRequest]. Please:
- Document this breaking change in release notes/migration guide, and update the method docstring to reflect the new parameter.
- Prefer the parameter name active_requests (not activate_requests) for clarity/consistency across the codebase.
Suggested rename (non-functional, improves readability):
- def fetch_new_requests( - self, activate_requests: List[LlmRequest]) -> List[LlmRequest]: + def fetch_new_requests( + self, active_requests: List[LlmRequest]) -> List[LlmRequest]: @@ - if self.enable_attention_dp: - return self._fetch_new_requests_attention_dp(activate_requests) + if self.enable_attention_dp: + return self._fetch_new_requests_attention_dp(active_requests) - else: - return self._fetch_new_requests_attention_tp(len(activate_requests)) + else: + return self._fetch_new_requests_attention_tp(len(active_requests))tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1187-1189: Call site updated to pass active requests list — good.Internal usage now matches the new API and unblocks ADP token balancing. Ensure any other call sites are updated.
Run to verify no remaining int-based calls:
#!/bin/bash # Find fetch_new_requests call sites that are not passing a list named active_requests rg -nP '\.fetch_new_requests\((?!\s*self\.active_requests\s*\))' --type py -g '!**/test/**'tests/unittest/_torch/executor/test_executor_request_queue.py (1)
642-644: Fix the outdated arithmetic comment to match the implementation.Use the exact formula used in code: ceil-div via (x + tp_size - 1) // tp_size. Prior review flagged this.
- # 2 + 1 + 3 + 0 = 6, 6 + 2 = 8, (8 + 3) // 4 = 2, max(2, 2, 1, 3, 0) = 3 - # expected_num_active_requests = max((6 + 2 + 3) // 4, 3) = max(2, 3) = 3 + # 2 + 1 + 3 + 0 = 6, 6 + 2 = 8; tp_size = 4 + # expected_num_active_requests = max(((6 + 2) + 4 - 1) // 4, max(2, 1, 3, 0)) + # = max(11 // 4 = 2, 3) = 3
🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)
1-2: Add NVIDIA copyright header.Per repository guidelines, prepend the NVIDIA copyright header to all .py sources.
Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
476-534: Heap-based balancing: edge-case safety and maintainability nits.
- If every rank is already at expected_num_active_requests, the filtered heap is empty; current logic only executes when new_requests is truthy, but an explicit early return after filtering would be clearer.
- Micro-nit: use clearer tuple fields or a dataclass to self-document the heap item.
Example early guard:
all_ranks_new_requests_heap = [ val for val in all_ranks_new_requests_heap if val.num_requests < self.expected_num_active_requests ] + if not all_ranks_new_requests_heap: + return all_ranks_new_requeststensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1-2: Add NVIDIA copyright header.Please prepend the NVIDIA header to comply with coding guidelines.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.tests/unittest/_torch/executor/test_executor_request_queue.py (3)
1-2: Add NVIDIA copyright header in tests.Tests are also Python sources; add the required header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
661-665: Remove no-op statement in the test.The standalone all_ranks_new_requests[0] has no effect; drop it to keep tests clean.
- all_ranks_new_requests[0]
1003-1004: Avoid printing from tests.Printing can add noise to CI logs. Prefer assertions only.
- print("all_ranks_new_requests:", all_ranks_new_requests)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(1 hunks)tests/unittest/_torch/executor/test_executor_request_queue.py(20 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/unittest/_torch/executor/test_executor_request_queue.pytensorrt_llm/_torch/pyexecutor/executor_request_queue.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/unittest/_torch/executor/test_executor_request_queue.pytensorrt_llm/_torch/pyexecutor/executor_request_queue.py
🧬 Code graph analysis (2)
tests/unittest/_torch/executor/test_executor_request_queue.py (1)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3)
_schedule_attention_dp_requests(372-424)_balance_requests_across_ranks(477-535)RequestQueueItem(24-38)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (5)
tensorrt_llm/_torch/pyexecutor/llm_request.py (3)
LlmRequest(282-424)append(79-98)append(125-142)tests/unittest/_torch/executor/test_executor_request_queue.py (2)
all_ranks_num_active_requests(452-453)all_ranks_num_active_tokens(457-458)tensorrt_llm/serve/router.py (2)
num_active_tokens(139-140)num_active_requests(142-143)tensorrt_llm/_torch/distributed/communicator.py (1)
tp_allgather(138-139)tensorrt_llm/mapping.py (1)
tp_rank(338-339)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/_torch/executor/test_executor_request_queue.py (2)
456-459: Fixture for per-rank token counts looks good.This aligns tests with the new token-aware balancing path.
1011-1218: Strong coverage for heap/token-based balancing.The new tests exercise empty, single, multiple, capacity-limit, heap-ordering, and token-sort scenarios. Nice.
|
PR_Github #16502 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #16543 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unittest/_torch/executor/test_executor_request_queue.py (1)
657-663: Patch stub should preserve return type: use dict[int, list[RequestQueueItem]]Mocking _balance_requests_across_ranks to return {0: req1} violates the function’s contract and may hide type errors. Return per-rank lists instead.
Apply:
- mock_balance.return_value = {0: req1} + mock_balance.return_value = {0: [req1], 1: [], 2: [], 3: []}Also applies to: 671-673
♻️ Duplicate comments (1)
tests/unittest/_torch/executor/test_executor_request_queue.py (1)
642-644: Fix the explanatory comment to match the formula (duplicate of prior feedback)The second line is correct now, but the first line still shows an extra “2” inside the max(...) tuple. Suggest tightening the two-line explanation to mirror the code’s ceil-div and max operations precisely.
Apply:
- # 2 + 1 + 3 + 0 = 6, 6 + 2 = 8, (8 + 3) // 4 = 2, max(2, 2, 1, 3, 0) = 3 - # expected_num_active_requests = max((6 + 2 + 3) // 4, 3) = max(2, 3) = 3 + # total_active = 2 + 1 + 3 + 0 = 6 + # num_new = 2, tp_size = 4 -> ceil((6 + 2) / 4) = (8 + 4 - 1) // 4 = 11 // 4 = 2 + # max_active_per_rank = max(2, 1, 3, 0) = 3 + # expected_num_active_requests = max(2, 3) = 3
🧹 Nitpick comments (2)
tests/unittest/_torch/executor/test_executor_request_queue.py (2)
1004-1004: Remove debug print in testsAvoid stdout noise in test runs; assertions already validate behavior.
Apply:
- print("all_ranks_new_requests:", all_ranks_new_requests)
1100-1130: Capacity limit test is valuable, but it validates ‘expected’ capacity, not max per-rank capacityThis test ensures ranks don’t exceed expected_num_active_requests. Add a complementary test to ensure scheduling never exceeds max_num_active_requests when expected_num_active_requests > max_num_active_requests.
Proposed additional test:
def test_balance_requests_respects_max_capacity_even_if_expected_higher(attention_dp_queue): # Simulate expected > max per-rank capacity attention_dp_queue.max_num_active_requests = 2 attention_dp_queue.expected_num_active_requests = 5 # deliberately higher all_ranks_num_active_requests = [2, 0, 0, 0] # rank 0 already at max all_ranks_num_active_tokens = [0, 0, 0, 0] # 3 new relaxed requests (no target) reqs = [] for i in range(3): r = RequestQueueItem(i, create_mock_request_with_py_schedule_params(attention_dp_rank=None)) r.request.input_token_ids = [1, 2, 3] reqs.append(r) # Balance all_ranks_new_requests = {0: [], 1: [], 2: [], 3: []} result = attention_dp_queue._balance_requests_across_ranks( reqs, all_ranks_new_requests, all_ranks_num_active_requests, all_ranks_num_active_tokens ) # Rank 0 must stay at capacity; others can receive at most 2 total per rank assert len(result[0]) == 0 assert len(result[1]) <= 2 assert len(result[2]) <= 2 assert len(result[3]) <= 2Note: This may currently fail if _balance_requests_across_ranks does not enforce max_num_active_requests. If so, production code should cap per-rank assignments using max_num_active_requests while balancing.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(1 hunks)tests/unittest/_torch/executor/test_executor_request_queue.py(20 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/unittest/_torch/executor/test_executor_request_queue.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/unittest/_torch/executor/test_executor_request_queue.py
🧬 Code graph analysis (1)
tests/unittest/_torch/executor/test_executor_request_queue.py (1)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3)
_schedule_attention_dp_requests(372-424)_balance_requests_across_ranks(477-535)RequestQueueItem(24-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1188-1189: fetch_new_requests signature and usage verified
- The method
fetch_new_requests(self, activate_requests: List[LlmRequest]) -> List[LlmRequest]is defined in
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py(around line 305).- The only call site in
tensorrt_llm/_torch/pyexecutor/py_executor.py(lines 1187–1188) passesself.active_requestsdirectly.- No instances of
fetch_new_requests(len(...))remain in the codebase.All call sites and the definition align with the updated queue API and correctly pass the full request list.
tests/unittest/_torch/executor/test_executor_request_queue.py (9)
456-459: LGTM: token-count fixture introducedThe all_ranks_num_active_tokens fixture establishes the per-rank token surface needed for token-aware balancing. Naming and values are clear.
483-501: Good propagation: tests now pass tokens into _schedule_attention_dp_requestsAdding the fourth argument (all_ranks_num_active_tokens) validates the new scheduling contract. This ensures tests cover token-driven behavior rather than only request counts.
510-527: Broad coverage for token-aware pathsMultiple tests were updated to pass all_ranks_num_active_tokens into scheduling. This is the right breadth to catch regressions. Nice work.
Also applies to: 536-552, 559-576, 579-606, 614-620, 624-640, 676-694, 696-724, 726-751
995-1002: Seeding deterministic token loads in integration helper is solidCreating a simple increasing token baseline per rank makes the distribution predictable while remaining realistic. Consider parameterizing this in future if more scenarios are needed.
1011-1028: _balance_requests_across_ranks: empty input test is accurate and necessaryCovers the no-op path and verifies structure preservation across ranks.
1030-1053: Single-request placement test looks goodValidates that the lowest-load rank is selected under the heap policy.
1055-1098: Multiple-requests distribution: expectations match heap + token-desc orderingThe step-by-step expectation aligns with “largest request to lowest token rank first.” Good guidance in comments and correct assertions.
1132-1175: Heap ordering test is preciseConfirms tiebreakers favor ranks with fewer active requests (then rank id), which matches the namedtuple/heap semantics.
1177-1218: Token-desc sorting test validates the key heuristicEnsures requests are sorted by token count before distribution, then assigned to the least-loaded ranks; assertions are spot on.
|
PR_Github #16543 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| @nvtx_range("_fetch_new_requests") | ||
| def fetch_new_requests(self, num_active_requests: int) -> List[LlmRequest]: | ||
| def fetch_new_requests( | ||
| self, activate_requests: List[LlmRequest]) -> List[LlmRequest]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does activate_request means? Should that be active_requests?
Summary by CodeRabbit
New Features
Refactor
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.