KEMBAR78
[None][opt] Balance the request based on number of tokens in AttentionDP by Shunkangz · Pull Request #7183 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@Shunkangz
Copy link
Collaborator

@Shunkangz Shunkangz commented Aug 23, 2025

Summary by CodeRabbit

  • New Features

    • Token-aware request scheduling for attention data-parallel execution, balancing by both per-rank active token counts and request counts to improve fairness, throughput, and latency under uneven inputs.
  • Refactor

    • Public API updated to accept a list of active requests instead of a numeric count; callers must pass request lists accordingly.
  • Tests

    • Expanded unit and integration tests and fixtures to validate token-based balancing, capacity limits, heap ordering, and end-to-end scheduling.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@Shunkangz Shunkangz requested a review from a team as a code owner August 23, 2025 14:15
@Shunkangz Shunkangz requested a review from HuiGao-NV August 23, 2025 14:15
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 23, 2025

📝 Walkthrough

Walkthrough

ExecutorRequestQueue.fetch_new_requests signature changed to accept a list of LlmRequest (activate_requests). Attention-DP paths now compute per-rank active token counts, allgather both request and token arrays, and propagate token-aware scheduling and balancing through fetch, schedule, and balance functions. Tests updated to exercise token-aware logic.

Changes

Cohort / File(s) Summary of edits
ExecutorRequestQueue (production)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
Changed fetch_new_requests to accept activate_requests: List[LlmRequest]. Attention-DP fetch computes local token totals, uses dist.tp_allgather to produce all_ranks_num_active_requests and all_ranks_num_active_tokens. Updated _schedule_attention_dp_requests and _balance_requests_across_ranks signatures and logic to consume both arrays; heap init and updates now use per-rank token counts and update both request and token loads when assigning requests.
PyExecutor callsite (production)
tensorrt_llm/_torch/pyexecutor/py_executor.py
Updated _fetch_and_activate_new_requests to pass self.active_requests (list) to ExecutorRequestQueue.fetch_new_requests instead of len(self.active_requests).
Unit tests (token-aware scheduling)
tests/unittest/_torch/executor/test_executor_request_queue.py
Added all_ranks_num_active_tokens fixture. Updated multiple tests to accept and propagate token arrays into scheduling/balancing calls. Added unit tests for _balance_requests_across_ranks (empty, single, multiple, capacity, heap ordering, token-count sorting) and adjusted integration-style tests to initialize and pass per-rank token counts.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant ExecQueue as ExecutorRequestQueue
  participant Dist as Dist(tp_allgather)
  participant Scheduler as Attention-DP Scheduler
  participant Balancer

  Caller->>ExecQueue: fetch_new_requests(activate_requests)
  ExecQueue->>ExecQueue: compute local_num_tokens = sum(req.input_token_ids)
  ExecQueue->>Dist: allgather([len(activate_requests), local_num_tokens])
  Dist-->>ExecQueue: all_ranks_num_active_requests, all_ranks_num_active_tokens
  ExecQueue->>Scheduler: _schedule_attention_dp_requests(activate_requests, all_ranks_num_active_requests, all_ranks_num_active_tokens)
  Scheduler->>Balancer: _balance_requests_across_ranks(requests, capacities, all_ranks_num_active_requests, all_ranks_num_active_tokens)
  Balancer-->>Scheduler: assignments (token-aware)
  Scheduler-->>ExecQueue: activation plan
  ExecQueue-->>Caller: activated requests
  note right of Balancer: New: heap init & updates use per-rank token counts
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • HuiGao-NV
  • QiJune
  • pcastonguay
  • jinyangyuan-nvidia

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16275 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

516-530: Ensure empty check before heap operations.

The heap operations could fail if all_ranks_new_requests_heap is empty after filtering.

             heapq.heapify(all_ranks_new_requests_heap)
 
             # Sort by token count (descending) for better load balancing
             new_requests = sorted(
                 new_requests,
                 key=lambda x: len(getattr(x.request, 'input_token_ids', []))
                 if x.request else 0,
                 reverse=True)
 
             # Distribute requests across ranks
             for req_item in new_requests:
-
+                if not all_ranks_new_requests_heap:
+                    break
                 val = heapq.heappop(all_ranks_new_requests_heap)
                 token_count = len(
                     getattr(req_item.request, 'input_token_ids',
                             [])) if req_item.request else 0
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3d54a1a and 1ce4a9a.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (8 hunks)
  • tests/unittest/_torch/executor/test_executor_request_queue.py (20 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tests/unittest/_torch/executor/test_executor_request_queue.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tests/unittest/_torch/executor/test_executor_request_queue.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (17)
tests/unittest/_torch/executor/test_executor_request_queue.py (11)

456-459: LGTM! Token tracking fixture added correctly.

The new all_ranks_num_active_tokens fixture properly complements the existing all_ranks_num_active_requests fixture and supports the token-aware scheduling feature being tested.


484-500: Test parameter updated correctly for token-aware scheduling.

The test now properly passes the all_ranks_num_active_tokens parameter to _schedule_attention_dp_requests, aligning with the production code's new token-aware scheduling logic.


662-673: Test correctly validates token parameter propagation.

The test properly verifies that _balance_requests_across_ranks is called with both the request counts and token counts parameters.


869-962: Test cases comprehensively cover token-based distribution scenarios.

The parameterized test cases thoroughly validate the token-aware distribution logic across various scenarios including balanced distribution, capacity limits, targeted vs. relaxed requests, and mixed cases.


995-1001: Good integration test enhancement for token tracking.

The test correctly creates mock token counts and passes them through the scheduling workflow, ensuring proper integration of the token-aware scheduling feature.


1011-1028: Comprehensive test for empty requests scenario.

The test properly validates that _balance_requests_across_ranks handles empty request lists correctly while maintaining the per-rank structure.


1030-1053: Test validates single request assignment to lowest token count rank.

The test correctly verifies that a single request is assigned to the rank with the lowest active token count (rank 2), demonstrating proper heap-based load balancing.


1055-1098: Well-structured test for multi-request load balancing.

The test correctly validates that multiple requests are distributed based on token counts and current rank loads. The expected distribution pattern follows the heap-based algorithm correctly.


1100-1130: Good test for capacity limit enforcement.

The test properly validates that the balancing algorithm respects the expected_num_active_requests capacity limit for each rank.


1132-1175: Test validates heap ordering correctly.

The test properly verifies that requests are distributed based on heap ordering (lowest active count first) when token counts are equal.


1177-1217: Excellent test for token count sorting behavior.

The test correctly validates that requests are sorted by token count in descending order before distribution, ensuring proper load balancing across ranks.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (6)

331-342: LGTM! Token-aware scheduling implementation looks correct.

The implementation properly:

  1. Collects token counts from active requests
  2. Uses tp_allgather to distribute both request counts and token counts across ranks
  3. Unpacks and stores the gathered data correctly

355-357: Good integration of token counts into scheduling.

The token counts are properly passed to _schedule_attention_dp_requests for token-aware load balancing.


373-376: Method signature correctly updated for token-aware scheduling.

The additional all_ranks_num_active_tokens parameter enables token-based load balancing in the scheduling logic.


421-423: Token counts properly integrated into balancing logic.

The token counts are correctly passed to _balance_requests_across_ranks for improved load distribution.


478-482: Method signature updated for token-aware balancing.

The additional all_ranks_num_active_tokens parameter enables the balancing algorithm to consider token loads when distributing requests.


490-491: Good initialization of heap with token counts.

Using actual token counts instead of 0 for initial heap values provides better load balancing from the start.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16275 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12234 completed with status: 'FAILURE'

@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

Shunkang added 2 commits August 26, 2025 02:30
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16501 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16502 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)

392-393: Sorting contradicts intent; non-relax requests should be prioritized.

Comment says “Prioritize the requests that are not in relax mode,” but reverse=True places relaxed (True) before non-relaxed (False). Drop reverse=True (or invert the key).

-        new_requests = sorted(new_requests, key=get_relax_value, reverse=True)
+        new_requests = sorted(new_requests, key=get_relax_value)

401-407: Token-aware balancing should account for targeted assignments too.

When a request is scheduled to its target_dp_rank, you increment all_ranks_num_active_requests but not all_ranks_num_active_tokens. This skews the subsequent heap-based token balancing.

                 if target_dp_rank is not None and all_ranks_num_active_requests[
                         target_dp_rank] < self.max_num_active_requests:
                     all_ranks_num_active_requests[target_dp_rank] += 1
                     scheduled = True
                     all_ranks_new_requests[target_dp_rank].append(req_item)
+                    # Keep token load in sync for fair balancing later
+                    token_count = 0
+                    if req_item.request is not None:
+                        token_count = len(getattr(req_item.request, "input_token_ids", []))
+                    all_ranks_num_active_tokens[target_dp_rank] += token_count
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

305-312: Public API change: document and align naming.

The fetch_new_requests signature now accepts a List[LlmRequest]. Please:

  • Document this breaking change in release notes/migration guide, and update the method docstring to reflect the new parameter.
  • Prefer the parameter name active_requests (not activate_requests) for clarity/consistency across the codebase.

Suggested rename (non-functional, improves readability):

-    def fetch_new_requests(
-            self, activate_requests: List[LlmRequest]) -> List[LlmRequest]:
+    def fetch_new_requests(
+            self, active_requests: List[LlmRequest]) -> List[LlmRequest]:
@@
-        if self.enable_attention_dp:
-            return self._fetch_new_requests_attention_dp(activate_requests)
+        if self.enable_attention_dp:
+            return self._fetch_new_requests_attention_dp(active_requests)
-        else:
-            return self._fetch_new_requests_attention_tp(len(activate_requests))
+        else:
+            return self._fetch_new_requests_attention_tp(len(active_requests))
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1187-1189: Call site updated to pass active requests list — good.

Internal usage now matches the new API and unblocks ADP token balancing. Ensure any other call sites are updated.

Run to verify no remaining int-based calls:

#!/bin/bash
# Find fetch_new_requests call sites that are not passing a list named active_requests
rg -nP '\.fetch_new_requests\((?!\s*self\.active_requests\s*\))' --type py -g '!**/test/**'
tests/unittest/_torch/executor/test_executor_request_queue.py (1)

642-644: Fix the outdated arithmetic comment to match the implementation.

Use the exact formula used in code: ceil-div via (x + tp_size - 1) // tp_size. Prior review flagged this.

-    # 2 + 1 + 3 + 0 = 6, 6 + 2 = 8, (8 + 3) // 4 = 2, max(2, 2, 1, 3, 0) = 3
-    # expected_num_active_requests = max((6 + 2 + 3) // 4, 3) = max(2, 3) = 3
+    # 2 + 1 + 3 + 0 = 6, 6 + 2 = 8; tp_size = 4
+    # expected_num_active_requests = max(((6 + 2) + 4 - 1) // 4, max(2, 1, 3, 0))
+    # = max(11 // 4 = 2, 3) = 3
🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)

1-2: Add NVIDIA copyright header.

Per repository guidelines, prepend the NVIDIA copyright header to all .py sources.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

476-534: Heap-based balancing: edge-case safety and maintainability nits.

  • If every rank is already at expected_num_active_requests, the filtered heap is empty; current logic only executes when new_requests is truthy, but an explicit early return after filtering would be clearer.
  • Micro-nit: use clearer tuple fields or a dataclass to self-document the heap item.

Example early guard:

             all_ranks_new_requests_heap = [
                 val for val in all_ranks_new_requests_heap
                 if val.num_requests < self.expected_num_active_requests
             ]
+            if not all_ranks_new_requests_heap:
+                return all_ranks_new_requests
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1-2: Add NVIDIA copyright header.

Please prepend the NVIDIA header to comply with coding guidelines.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tests/unittest/_torch/executor/test_executor_request_queue.py (3)

1-2: Add NVIDIA copyright header in tests.

Tests are also Python sources; add the required header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

661-665: Remove no-op statement in the test.

The standalone all_ranks_new_requests[0] has no effect; drop it to keep tests clean.

-        all_ranks_new_requests[0]

1003-1004: Avoid printing from tests.

Printing can add noise to CI logs. Prefer assertions only.

-    print("all_ranks_new_requests:", all_ranks_new_requests)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1ce4a9a and 653162c.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (8 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tests/unittest/_torch/executor/test_executor_request_queue.py (20 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/unittest/_torch/executor/test_executor_request_queue.py
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/unittest/_torch/executor/test_executor_request_queue.py
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
🧬 Code graph analysis (2)
tests/unittest/_torch/executor/test_executor_request_queue.py (1)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3)
  • _schedule_attention_dp_requests (372-424)
  • _balance_requests_across_ranks (477-535)
  • RequestQueueItem (24-38)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (5)
tensorrt_llm/_torch/pyexecutor/llm_request.py (3)
  • LlmRequest (282-424)
  • append (79-98)
  • append (125-142)
tests/unittest/_torch/executor/test_executor_request_queue.py (2)
  • all_ranks_num_active_requests (452-453)
  • all_ranks_num_active_tokens (457-458)
tensorrt_llm/serve/router.py (2)
  • num_active_tokens (139-140)
  • num_active_requests (142-143)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_allgather (138-139)
tensorrt_llm/mapping.py (1)
  • tp_rank (338-339)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/unittest/_torch/executor/test_executor_request_queue.py (2)

456-459: Fixture for per-rank token counts looks good.

This aligns tests with the new token-aware balancing path.


1011-1218: Strong coverage for heap/token-based balancing.

The new tests exercise empty, single, multiple, capacity-limit, heap-ordering, and token-sort scenarios. Nice.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16502 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12395 completed with status: 'FAILURE'

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16543 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unittest/_torch/executor/test_executor_request_queue.py (1)

657-663: Patch stub should preserve return type: use dict[int, list[RequestQueueItem]]

Mocking _balance_requests_across_ranks to return {0: req1} violates the function’s contract and may hide type errors. Return per-rank lists instead.

Apply:

-        mock_balance.return_value = {0: req1}
+        mock_balance.return_value = {0: [req1], 1: [], 2: [], 3: []}

Also applies to: 671-673

♻️ Duplicate comments (1)
tests/unittest/_torch/executor/test_executor_request_queue.py (1)

642-644: Fix the explanatory comment to match the formula (duplicate of prior feedback)

The second line is correct now, but the first line still shows an extra “2” inside the max(...) tuple. Suggest tightening the two-line explanation to mirror the code’s ceil-div and max operations precisely.

Apply:

-    # 2 + 1 + 3 + 0 = 6, 6 + 2 = 8, (8 + 3) // 4 = 2, max(2, 2, 1, 3, 0) = 3
-    # expected_num_active_requests = max((6 + 2 + 3) // 4, 3) = max(2, 3) = 3
+    # total_active = 2 + 1 + 3 + 0 = 6
+    # num_new = 2, tp_size = 4 -> ceil((6 + 2) / 4) = (8 + 4 - 1) // 4 = 11 // 4 = 2
+    # max_active_per_rank = max(2, 1, 3, 0) = 3
+    # expected_num_active_requests = max(2, 3) = 3
🧹 Nitpick comments (2)
tests/unittest/_torch/executor/test_executor_request_queue.py (2)

1004-1004: Remove debug print in tests

Avoid stdout noise in test runs; assertions already validate behavior.

Apply:

-    print("all_ranks_new_requests:", all_ranks_new_requests)

1100-1130: Capacity limit test is valuable, but it validates ‘expected’ capacity, not max per-rank capacity

This test ensures ranks don’t exceed expected_num_active_requests. Add a complementary test to ensure scheduling never exceeds max_num_active_requests when expected_num_active_requests > max_num_active_requests.

Proposed additional test:

def test_balance_requests_respects_max_capacity_even_if_expected_higher(attention_dp_queue):
    # Simulate expected > max per-rank capacity
    attention_dp_queue.max_num_active_requests = 2
    attention_dp_queue.expected_num_active_requests = 5  # deliberately higher
    all_ranks_num_active_requests = [2, 0, 0, 0]  # rank 0 already at max
    all_ranks_num_active_tokens = [0, 0, 0, 0]
    # 3 new relaxed requests (no target)
    reqs = []
    for i in range(3):
        r = RequestQueueItem(i, create_mock_request_with_py_schedule_params(attention_dp_rank=None))
        r.request.input_token_ids = [1, 2, 3]
        reqs.append(r)

    # Balance
    all_ranks_new_requests = {0: [], 1: [], 2: [], 3: []}
    result = attention_dp_queue._balance_requests_across_ranks(
        reqs, all_ranks_new_requests, all_ranks_num_active_requests, all_ranks_num_active_tokens
    )

    # Rank 0 must stay at capacity; others can receive at most 2 total per rank
    assert len(result[0]) == 0
    assert len(result[1]) <= 2
    assert len(result[2]) <= 2
    assert len(result[3]) <= 2

Note: This may currently fail if _balance_requests_across_ranks does not enforce max_num_active_requests. If so, production code should cap per-rank assignments using max_num_active_requests while balancing.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 653162c and 693022b.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (8 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tests/unittest/_torch/executor/test_executor_request_queue.py (20 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/unittest/_torch/executor/test_executor_request_queue.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/unittest/_torch/executor/test_executor_request_queue.py
🧬 Code graph analysis (1)
tests/unittest/_torch/executor/test_executor_request_queue.py (1)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3)
  • _schedule_attention_dp_requests (372-424)
  • _balance_requests_across_ranks (477-535)
  • RequestQueueItem (24-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1188-1189: fetch_new_requests signature and usage verified

  • The method fetch_new_requests(self, activate_requests: List[LlmRequest]) -> List[LlmRequest] is defined in
    tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (around line 305).
  • The only call site in
    tensorrt_llm/_torch/pyexecutor/py_executor.py (lines 1187–1188) passes self.active_requests directly.
  • No instances of fetch_new_requests(len(...)) remain in the codebase.

All call sites and the definition align with the updated queue API and correctly pass the full request list.

tests/unittest/_torch/executor/test_executor_request_queue.py (9)

456-459: LGTM: token-count fixture introduced

The all_ranks_num_active_tokens fixture establishes the per-rank token surface needed for token-aware balancing. Naming and values are clear.


483-501: Good propagation: tests now pass tokens into _schedule_attention_dp_requests

Adding the fourth argument (all_ranks_num_active_tokens) validates the new scheduling contract. This ensures tests cover token-driven behavior rather than only request counts.


510-527: Broad coverage for token-aware paths

Multiple tests were updated to pass all_ranks_num_active_tokens into scheduling. This is the right breadth to catch regressions. Nice work.

Also applies to: 536-552, 559-576, 579-606, 614-620, 624-640, 676-694, 696-724, 726-751


995-1002: Seeding deterministic token loads in integration helper is solid

Creating a simple increasing token baseline per rank makes the distribution predictable while remaining realistic. Consider parameterizing this in future if more scenarios are needed.


1011-1028: _balance_requests_across_ranks: empty input test is accurate and necessary

Covers the no-op path and verifies structure preservation across ranks.


1030-1053: Single-request placement test looks good

Validates that the lowest-load rank is selected under the heap policy.


1055-1098: Multiple-requests distribution: expectations match heap + token-desc ordering

The step-by-step expectation aligns with “largest request to lowest token rank first.” Good guidance in comments and correct assertions.


1132-1175: Heap ordering test is precise

Confirms tiebreakers favor ranks with fewer active requests (then rank id), which matches the namedtuple/heap semantics.


1177-1218: Token-desc sorting test validates the key heuristic

Ensures requests are sorted by token count before distribution, then assigned to the least-loaded ranks; assertions are spot on.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16543 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12422 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Copy link
Collaborator

@byshiue byshiue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@byshiue byshiue merged commit ff40474 into NVIDIA:main Aug 27, 2025
5 checks passed
@nvtx_range("_fetch_new_requests")
def fetch_new_requests(self, num_active_requests: int) -> List[LlmRequest]:
def fetch_new_requests(
self, activate_requests: List[LlmRequest]) -> List[LlmRequest]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does activate_request means? Should that be active_requests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants