-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None] [feat] Update disagg gen-only benchmark. #7917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
|
/bot run |
📝 WalkthroughWalkthroughAdds a benchmark-aware forward-gating mechanism in _executor_loop_overlap: removes an initial blocking wait, introduces a can_forward flag, and delays scheduling until request queues reach benchmark thresholds, with attention data-parallel (DP) specific allgather coordination. Warmup handling remains, integrating the new gating before normal scheduling proceeds. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User/Benchmark Driver
participant P as PyExecutor
participant Q as RequestQueue
participant W as World (Ranks)
note over P: _executor_loop_overlap start (benchmark mode)
U->>Q: Enqueue benchmark requests
P->>P: Initialize can_forward (False if benchmark+KV transceiver)
alt is_warmup
P->>P: Warmup handling (unchanged)
else not warmup and can_forward == False
alt attention_dp enabled
P->>Q: Inspect fetched/scheduled requests
P->>W: allgather(local_can_forward)
alt all ranks can_forward
P->>P: sleep(30s)
P->>P: can_forward = True
else waiting
Note right of P: rank0 logs progress
P->>P: sleep(10s)
P-->>P: continue loop
end
else attention_dp disabled
P->>Q: Check generation requests count
alt count < benchmark_req_queues_size
Note right of P: rank0 logs waiting
P->>P: sleep(10s)
P-->>P: continue loop
else ready
P->>P: can_forward = True
end
end
end
opt can_forward == True
P->>Q: Normal scheduling/processing
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
1164-1172: Make sleeps interruptible and configurable; avoid 30s uninterruptible stall.Hard-coded
time.sleep(30)blocks shutdown and delays responsiveness. Useshutdown_event.wait()and env-configurable durations.Apply:
- if all(all_can_forward): - time.sleep(30) - can_forward = True + if all(all_can_forward): + ready_sleep_s = int(os.getenv("TLLM_BENCHMARK_READY_SLEEP_SEC", "30")) + if self.shutdown_event.wait(ready_sleep_s): + break + can_forward = True else: - if self.dist.rank == 0: - logger.info( - f"sleep 10 seconds, num_fetched_requests: {self.executor_request_queue.num_fetch_requests}, scheduled_gen_batch: {len(scheduled_batch.generation_requests)}" - ) - time.sleep(10) + wait_sleep_s = int(os.getenv("TLLM_BENCHMARK_WAIT_SLEEP_SEC", "10")) + if self.dist.rank == 0: + logger.info( + f"sleep {wait_sleep_s} seconds, " + f"num_fetched_requests: {self.executor_request_queue.num_fetch_requests}, " + f"scheduled_gen_batch: {len(scheduled_batch.generation_requests)}, " + f"threshold: {self.benchmark_req_queues_size}" + ) + if self.shutdown_event.wait(wait_sleep_s): + break continue
1174-1183: Fix incorrect log metric and make wait interruptible in non-attention-DP path.Logged
num_fetched_requestsuseslen(scheduled_batch.generation_requests)instead of the fetched metric; also useshutdown_event.wait().- if len(scheduled_batch.generation_requests - ) < self.benchmark_req_queues_size: - if self.dist.rank == 0: - logger.info( - f"sleep 10 seconds, num_fetched_requests: {len(scheduled_batch.generation_requests)}, scheduled_gen_batch: {len(scheduled_batch.generation_requests)}" - ) - time.sleep(10) + if len(scheduled_batch.generation_requests) < self.benchmark_req_queues_size: + wait_sleep_s = int(os.getenv("TLLM_BENCHMARK_WAIT_SLEEP_SEC", "10")) + if self.dist.rank == 0: + logger.info( + f"sleep {wait_sleep_s} seconds, " + f"num_fetched_requests: {self.executor_request_queue.num_fetch_requests}, " + f"scheduled_gen_batch: {len(scheduled_batch.generation_requests)}, " + f"threshold: {self.benchmark_req_queues_size}" + ) + if self.shutdown_event.wait(wait_sleep_s): + break continue
1159-1172: Prefer progress-based readiness over fixed 30s delay (avoid over/under-waiting).Sleeping a fixed 30s after consensus may be too long or too short depending on KV transfer size and link bandwidth. Consider polling transfer status (e.g., any(req.is_disagg_generation_transmission_in_progress)) with a bounded wait to proceed as soon as transfers complete.
I can draft a bounded poll loop using
shutdown_event.wait(wait_sleep_s)and a max polls guard if you want.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
is_warmup(511-512)is_warmup(515-518)tensorrt_llm/_torch/distributed/communicator.py (3)
allgather(101-102)allgather(336-337)rank(29-30)tensorrt_llm/mapping.py (2)
rank(328-329)rank(332-339)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1155-1161: Double-counting risk in readiness predicate; please confirm intent.
local_can_forward = executor_request_queue.num_fetch_requests + len(scheduled_batch.generation_requests) >= threshold. Ifnum_fetch_requestsis already a global total, adding the local scheduled count may overcount. Non-attention-DP path only uses scheduled count, which is inconsistent.Would you prefer to:
- Compare only
num_fetch_requests(global) to the threshold, or- Allgather the local scheduled count and sum across attention‑DP ranks?
I can provide a patch once you confirm the intended semantics.
|
PR_Github #19636 [ run ] triggered by Bot |
|
/bot run |
|
PR_Github #19650 [ run ] triggered by Bot |
|
PR_Github #19636 [ run ] completed with state |
|
PR_Github #19650 [ run ] completed with state |
|
/bot reuse-pipeline |
|
PR_Github #19732 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #19732 [ reuse-pipeline ] completed with state |
55b9b26 to
abb209c
Compare
|
/bot run |
|
PR_Github #19934 [ run ] triggered by Bot |
|
PR_Github #19934 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #20009 [ run ] triggered by Bot |
|
/bot run --disable-fail-fast |
|
PR_Github #20040 [ run ] triggered by Bot |
|
PR_Github #20009 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #20074 [ run ] triggered by Bot |
|
PR_Github #20040 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #20096 [ run ] triggered by Bot |
|
PR_Github #20074 [ run ] completed with state |
|
PR_Github #20096 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #20132 [ run ] triggered by Bot |
|
PR_Github #20132 [ run ] completed with state |
Sleep after kv-cache transfer to avoid using multiple ctx instances.