KEMBAR78
doc: Refactor documents and examples of disaggregated serving and wide ep by kaiyux · Pull Request #6054 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@kaiyux
Copy link
Member

@kaiyux kaiyux commented Jul 15, 2025

This PR:

  1. Unifies all disagg related slurm scripts to examples/disaggregated/slurm, including the scripts that were added in examples/wide_ep/slurm_scripts.
  2. Refines documentation.

Summary by CodeRabbit

  • New Features

    • Added a new Slurm job submission script for disaggregated Torch setups, supporting parameterized job configuration.
  • Bug Fixes

    • Corrected and standardized anchor links in blog documentation for improved navigation.
    • Fixed script path references in documentation for accurate script locations.
  • Documentation

    • Updated and streamlined multiple README files for clarity, consistency, and easier navigation.
    • Added references to new and relocated scripts, and improved example commands and section formatting.
    • Added guidance to verify SLURM parameters before running jobs in disaggregated serving workflows.
  • Chores

    • Removed several outdated or redundant benchmark, configuration, and utility scripts related to SLURM and disaggregated serving.
    • Updated SLURM batch script parameters to increase job time limits, enable streaming, and support multiple rounds.
    • Adjusted configuration defaults for cache transceiver settings and resource calculations in scripts.
    • Refactored SLURM job submission scripts to improve parameterization and task/node calculations.

@kaiyux kaiyux marked this pull request as draft July 15, 2025 10:14
@kaiyux kaiyux changed the title doc: Refactor disaggregated serving documents and examples doc: Refactor documents and examples of disaggregated serving and wide ep Jul 15, 2025
@kaiyux kaiyux marked this pull request as ready for review July 15, 2025 15:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors and consolidates SLURM scripts and accompanying documentation for both wide EP and disaggregated serving examples, updating parameter calculations, streamlining READMEs, and removing outdated script references.

  • Use a shared ntasks_per_node variable in wide_ep/submit.sh to calculate total tasks
  • Simplify the wide EP README to point to relocated disaggregated scripts
  • Introduce new examples/disaggregated/slurm scripts and adjust defaults (time, multi_round, streaming)
  • Refresh the disaggregated serving README and purge obsolete docs under docs/source/scripts/disaggregated

Reviewed Changes

Copilot reviewed 11 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
examples/wide_ep/slurm_scripts/submit.sh Compute total tasks via ntasks_per_node instead of hard-coded
examples/wide_ep/slurm_scripts/README.md Trim core scripts list and update usage to refer to disaggregated
examples/disaggregated/slurm/submit.sh Add standalone SLURM submit template for disaggregated benchmarks
examples/disaggregated/slurm/disaggr_torch.slurm Increase time, rounds, enable streaming, and expose nsys_on
examples/disaggregated/README.md Rename and reformat disaggregated serving instructions
Comments suppressed due to low confidence (4)

examples/wide_ep/slurm_scripts/README.md:21

  • The core scripts list has been reduced to only two items but still uses numbered bullets, which could confuse readers. Consider restoring the full list or renumbering to reflect only the remaining scripts.
2. **`process_gen_iterlog.py`** - Processes benchmark results and generates reports

examples/disaggregated/slurm/disaggr_torch.slurm:19

  • [nitpick] Since the default for streaming was changed to true, it would be helpful to add a brief comment explaining the impact of this flag on job behavior.
streaming=true

examples/disaggregated/README.md:1

  • [nitpick] The title was generalized to “Disaggregated Serving,” but elsewhere the project is called TensorRT-LLM. Consider renaming the heading to “TensorRT-LLM Disaggregated Serving” for consistency.
# Disaggregated Serving

Copy link
Collaborator

@pcastonguay pcastonguay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaiyux kaiyux force-pushed the user/kaiyu/polish_disagg_doc branch from 5168e43 to 6ded898 Compare July 16, 2025 01:44
@kaiyux
Copy link
Member Author

kaiyux commented Jul 16, 2025

@kaiyux
Copy link
Member Author

kaiyux commented Jul 16, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12013 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12013 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8920 completed with status: 'FAILURE'

@kaiyux kaiyux force-pushed the user/kaiyu/polish_disagg_doc branch from 619b993 to 82cb10d Compare July 16, 2025 09:57
@kaiyux
Copy link
Member Author

kaiyux commented Jul 16, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12083 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12083 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8974 completed with status: 'FAILURE'

kaiyux added 7 commits July 21, 2025 01:12
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
@kaiyux kaiyux force-pushed the user/kaiyu/polish_disagg_doc branch from 82cb10d to c486ce3 Compare July 21, 2025 08:13
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 21, 2025

Walkthrough

This update removes legacy SLURM scripts and related configuration generators for disaggregated serving, migrating all cluster launch and benchmarking scripts to the examples/disaggregated/slurm directory. Documentation is updated for clarity, with README files refocused to point to the new script locations and improved for consistency and readability.

Changes

File(s) Change Summary
docs/source/scripts/disaggregated/*.slurm, *.py, *.sh Deleted all legacy SLURM batch scripts, YAML generators, benchmarking, and worker launch scripts for disaggregated serving.
examples/disaggregated/slurm/disaggr_torch.slurm Increased job time limit, set multi_round to 10, enabled streaming, added nsys_on variable, consolidated mount options.
examples/disaggregated/slurm/gen_yaml.py Updated cache_transceiver_config to specify backend and increased token buffer size.
examples/disaggregated/slurm/slurm_populate_urls.py Deleted SLURM node/YAML population script.
examples/disaggregated/slurm/submit.sh Added new SLURM job submission script, parameterized for 8 nodes and 4 GPUs per node.
examples/disaggregated/README.md Improved clarity, formatting, and terminology; added SLURM launch section.
examples/wide_ep/slurm_scripts/README.md Simplified and refocused documentation, pointing users to main SLURM scripts elsewhere.
examples/wide_ep/slurm_scripts/submit.sh Refactored to use variables for tasks per node and total tasks; updated sbatch arguments.
docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md Standardized TOC anchor links and corrected script paths.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant submit.sh
    participant Slurm
    participant disaggr_torch.slurm
    participant gen_yaml.py

    User->>submit.sh: Run SLURM job submission script
    submit.sh->>Slurm: sbatch disaggr_torch.slurm with params
    Slurm->>disaggr_torch.slurm: Launch batch job
    disaggr_torch.slurm->>gen_yaml.py: Generate YAML config
    disaggr_torch.slurm->>Slurm: Start worker/server processes
    disaggr_torch.slurm->>User: Job logs/results
Loading

Estimated code review effort

4 (60–120 minutes)
This review involves verifying the removal of multiple legacy scripts, ensuring new scripts and documentation are correct and consistent, and validating parameterization and configuration logic in SLURM scripts.

Poem

A rabbit hops through scripts anew,
Old clusters gone, the docs askew—
With SLURM now neat and README bright,
The servers launch in streams of light.
Eight nodes strong, four GPUs each,
The future's near, within our reach!
🐇✨


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e82852 and d6b1d64.

📒 Files selected for processing (1)
  • examples/disaggregated/README.md (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/disaggregated/README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
examples/disaggregated/README.md (1)

108-109: Typo: “refersh_interval” → “refresh_interval”.

-refersh_interval: 10.0
+refresh_interval: 10.0
♻️ Duplicate comments (2)
examples/disaggregated/slurm/submit.sh (1)

1-2: Add error handling to improve script robustness.

The script would benefit from adding set -euo pipefail after the shebang to ensure it exits on errors, undefined variables, or pipeline failures.

 #!/bin/bash
+set -euo pipefail
 
examples/wide_ep/slurm_scripts/submit.sh (1)

9-9: Add documentation for the magic number.

The number 5500 should be documented as it represents the maximum number of requests each DEP4 context server can handle before sending them to generation servers for 1k-1k benchmarking.

-    ctx_num=$(((concurrency + 5499)/5500))
+    # 5500 is the max requests per DEP4 context server for 1k-1k benchmarking
+    ctx_num=$(((concurrency + 5499)/5500))
🧹 Nitpick comments (8)
examples/disaggregated/slurm/submit.sh (1)

27-32: Consider improving readability of the long command.

The sbatch command with 14 positional arguments is quite long and could benefit from line breaks for better readability.

-sbatch --nodes=${total_node_num} \
-    --ntasks=${ntasks} \
-    --ntasks-per-node=${ntasks_per_node} \
-    --segment=${total_node_num} \
-    disaggr_torch.slurm \
-        ${ctx_num} 4 4 4480 true 1 8 1024 1024 true "0.8" 0 0 "$concurrency"
+sbatch --nodes=${total_node_num} \
+    --ntasks=${ntasks} \
+    --ntasks-per-node=${ntasks_per_node} \
+    --segment=${total_node_num} \
+    disaggr_torch.slurm \
+        ${ctx_num} \
+        4 \
+        4 \
+        4480 \
+        true \
+        1 \
+        8 \
+        1024 \
+        1024 \
+        true \
+        "0.8" \
+        0 \
+        0 \
+        "$concurrency"
examples/wide_ep/slurm_scripts/submit.sh (1)

27-33: Consider refactoring single-iteration loop.

The static analysis tool correctly identifies that this loop only runs once. Consider whether this should be a loop or a simple command.

If this is intended to be a single execution, you could simplify it:

-# dep32 eplb288
-for b in 512; do
-    concurrency=$((b * 32))
-    ctx_num=$(((concurrency + 5499)/5500))
-    total_node_num=$((ctx_num + 8))
-    ntasks=$((total_node_num * ntasks_per_node))
-    sbatch --nodes=${total_node_num} --ntasks=${ntasks} --ntasks-per-node=${ntasks_per_node} --segment=${total_node_num} disaggr_torch.slurm ${ctx_num} 4 4 4480 true 1 32 1024 1024 true "0.7" 288 "$mtp_size" "$concurrency"
-done
+# dep32 eplb288
+b=512
+concurrency=$((b * 32))
+ctx_num=$(((concurrency + 5499)/5500))
+total_node_num=$((ctx_num + 8))
+ntasks=$((total_node_num * ntasks_per_node))
+sbatch --nodes=${total_node_num} --ntasks=${ntasks} --ntasks-per-node=${ntasks_per_node} --segment=${total_node_num} disaggr_torch.slurm ${ctx_num} 4 4 4480 true 1 32 1024 1024 true "0.7" 288 "$mtp_size" "$concurrency"
examples/disaggregated/README.md (5)

9-13: Add explicit language identifier to fenced block.

markdownlint flags MD040 – fenced code blocks should declare a language.
Mark this snippet as YAML for proper highlighting/lint‐compliance.

-```
+```yaml
 cache_transceiver_config:
   backend: <str>
   max_tokens_in_buffer: <int>

22-32: Label shell snippet and silence MD040.

All three server-launch commands sit in a single fenced block without a language tag.
Add bash so docs render correctly and pass lint.

-```
+```bash
 echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\n  backend: UCX\n  max_tokens_in_buffer: 2048" > context_extra-llm-api-config.yml
 ...
 CUDA_VISIBLE_DEVICES=2 trtllm-serve ...

38-56: Missing YAML language tag in disagg_config.yaml example.

Same MD040 issue; add yaml for consistency/readability.

-```
+```yaml
 hostname: localhost
 ...

68-76: Add bash tag to cURL example.

-``` 
+```bash
 curl http://localhost:8000/v1/completions \
 ...

170-172: Heading typo – “Know Issues” should be “Known Issues”.

Minor doc polish.

-## Know Issues
+## Known Issues
examples/wide_ep/slurm_scripts/README.md (1)

38-40: Inline comment risks breaking copy-paste.

The two comment lines inside the fenced bash block will be executed if copied verbatim.
Consider moving them outside the code block or prefixing with # inside a dedicated explanatory snippet.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3cbc23f and c486ce3.

📒 Files selected for processing (13)
  • docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md (2 hunks)
  • docs/source/scripts/disaggregated/disaggr_torch.slurm (0 hunks)
  • docs/source/scripts/disaggregated/gen_yaml.py (0 hunks)
  • docs/source/scripts/disaggregated/run_benchmark.sh (0 hunks)
  • docs/source/scripts/disaggregated/start_worker.sh (0 hunks)
  • docs/source/scripts/disaggregated/submit.sh (0 hunks)
  • examples/disaggregated/README.md (3 hunks)
  • examples/disaggregated/slurm/disaggr_torch.slurm (1 hunks)
  • examples/disaggregated/slurm/gen_yaml.py (2 hunks)
  • examples/disaggregated/slurm/slurm_populate_urls.py (0 hunks)
  • examples/disaggregated/slurm/submit.sh (1 hunks)
  • examples/wide_ep/slurm_scripts/README.md (2 hunks)
  • examples/wide_ep/slurm_scripts/submit.sh (1 hunks)
💤 Files with no reviewable changes (6)
  • docs/source/scripts/disaggregated/start_worker.sh
  • docs/source/scripts/disaggregated/submit.sh
  • examples/disaggregated/slurm/slurm_populate_urls.py
  • docs/source/scripts/disaggregated/run_benchmark.sh
  • docs/source/scripts/disaggregated/disaggr_torch.slurm
  • docs/source/scripts/disaggregated/gen_yaml.py
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
examples/disaggregated/README.md

67-67: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.10.0)
examples/wide_ep/slurm_scripts/submit.sh

[warning] 27-27: This loop will only ever run once. Bad quoting or missing glob/expansion?

(SC2043)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
examples/disaggregated/slurm/gen_yaml.py (2)

184-187: Configuration update looks good.

The cache transceiver configuration has been updated to use the new API with explicit backend specification and increased buffer size. The change from max_num_tokens to max_tokens_in_buffer and the increase from 4608 to 8320 tokens should improve performance.


206-209: Consistent configuration applied to generation servers.

The same cache transceiver configuration updates are correctly applied to both context and generation servers, maintaining consistency.

docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md (2)

280-280: Documentation link updated correctly.

The link has been properly updated to reflect the new location of the disaggregated serving scripts, which aligns with the consolidation efforts described in the PR objectives.


5-25: Table of contents standardized.

The anchor links have been properly standardized to lowercase and hyphenated format, improving markdown consistency and compatibility.

examples/disaggregated/slurm/disaggr_torch.slurm (4)

7-7: Time limit increase is appropriate.

Increasing the job time limit from 1 to 2 hours makes sense given the change to multiple rounds of testing.


12-12: Multiple rounds will provide better performance data.

Changing from single round to 10 rounds will give more comprehensive benchmarking results.


19-19: Streaming mode enabled for realistic testing.

Enabling streaming mode better reflects real-world usage scenarios.


102-113: Script dependencies verified

All referenced helper scripts (start_worker.sh, start_server.sh, run_benchmark.sh) are present in examples/disaggregated/slurm/. No changes are needed.

examples/wide_ep/slurm_scripts/submit.sh (1)

4-4: Good parameterization improvements.

The introduction of ntasks_per_node and calculated values for total_node_num and ntasks makes the script more maintainable and clearer about resource allocation.

Also applies to: 10-12

examples/disaggregated/README.md (1)

15-17: Verify spelling of backend option “NIXL”.

Common back-ends are UCX, NCCL, MPI, etc. “NIXL” looks suspicious – please confirm it is intentional and supported throughout the codebase / CLI help.

examples/wide_ep/slurm_scripts/README.md (1)

20-23: Clarify relocation note and check link accuracy.

Good to point users to the consolidated scripts, but please ensure
examples/disaggregated/slurm/submit.sh (and disaggr_torch.slurm) actually exist after the move; stale paths will confuse users.

kaiyux added 5 commits July 22, 2025 00:23
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
@kaiyux
Copy link
Member Author

kaiyux commented Jul 22, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12545 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12545 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9327 completed with status: 'FAILURE'

@kaiyux
Copy link
Member Author

kaiyux commented Jul 22, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12564 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12564 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9344 completed with status: 'SUCCESS'

@kaiyux kaiyux merged commit f08286c into NVIDIA:main Jul 23, 2025
3 checks passed
@kaiyux kaiyux deleted the user/kaiyu/polish_disagg_doc branch July 23, 2025 01:21
NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025
…e ep (NVIDIA#6054)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025
…e ep (NVIDIA#6054)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Ransiki Zhang <ransikiz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants