KEMBAR78
[None] [doc] Add more documents for large scale EP by kaiyux · Pull Request #7029 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@kaiyux
Copy link
Member

@kaiyux kaiyux commented Aug 19, 2025

Summary by CodeRabbit

  • Documentation
    • Added Wide Expert Parallelism guidance and YAML examples showing WIDEEP usage, increased token capacity, and online load balancer tuning.
    • Added Prerequisites section listing supported GPUs, OS/CUDA, Docker/NVIDIA toolkit and evaluation setup notes.
    • Reorganized load balancer docs into Online vs Offline sections; added offline reference and note recommending online for production.
    • Expanded Troubleshooting with NUMA binding and shared-memory cleanup guidance; improved References layout and added pointer to slurm_scripts.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
@kaiyux kaiyux requested a review from a team as a code owner August 19, 2025 08:11
@kaiyux kaiyux requested review from chzblych and nv-guomingz August 19, 2025 08:11
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 19, 2025

📝 Walkthrough

Walkthrough

Adds Wide Expert Parallelism (WIDEEP) configuration examples and YAML snippet to docs, restructures the wide-EP Load Balancer doc into Online and Offline sections, expands troubleshooting (GB200 NUMA binding and EPLB shared-memory cleanup), updates references and usage links, and adds slurm_scripts to examples list.

Changes

Cohort / File(s) Summary
Docs — Wide-EP README
examples/wide_ep/README.md
Reorganized Load Balancer docs into "Online Load Balancer Configuration" and "Offline Load Balancer Configuration"; moved load-balancer YAML under moe_config with backend: WIDEEP, max_num_tokens: 9216, and nested load_balancer (num_slots: 288, layer_updates_per_iter: 1); added GB200 NUMA binding guidance (numactl) and EPLB shared-memory cleanup (/dev/shm/moe_shared_*); updated references and added slurm_scripts/.
Docs — Deployment quick-start
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
Inserted Wide Expert Parallelism blocks (appearing in two locations) with moe_config.backend: WIDEEP, max_num_tokens: 9216, and load_balancer snippet (num_slots: 288, layer_updates_per_iter: 1) and linked to wide EP examples.
Examples — README small edits
examples/wide_ep/*
Minor clarifications and heading/name adjustments to distinguish online vs offline flow and production suitability.

Sequence Diagram(s)

(omitted — documentation-only changes; no control-flow diagrams applicable)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

Documentation, 1.0_doc

Suggested reviewers

  • qiaoxj07
  • chzblych
  • nv-guomingz
  • Shixiaowei02
  • WeiHaocheng

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
examples/wide_ep/README.md (3)

85-87: Specify a language for the fenced code block (markdownlint MD040)

Add a language to satisfy markdownlint and improve rendering.

-```
+```text
 FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all'

---

`88-89`: **Make cleanup instruction safer and more actionable**

Avoid suggesting a blanket wildcard delete without visibility; provide a minimal, interactive workflow and container note.


```diff
-you need to manually check `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if any.
+Manually inspect the `/dev/shm` directory and remove only the stale MoE shared-memory files if present.
+
+```bash
+# List potential MoE shared memory files
+ls -l /dev/shm | grep 'moe_shared_' || echo "No MoE shared memory files found"
+
+# Remove interactively to avoid accidental deletes
+rm -i -- /dev/shm/moe_shared_*
+```
+
+Note: If running inside a container, perform the cleanup in the container’s `/dev/shm` (not the host).

96-98: Normalize References formatting: use a subheading instead of a parent list item

Removes an unnecessary parent list item and aligns structure with common Markdown style.

-- Technical Blog: Scaling Expert Parallelism in TensorRT-LLM
-  - [Part 1: Design and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
-  - [Part 2: Performance Status and Optimization](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md)
+#### Technical Blog: Scaling Expert Parallelism in TensorRT-LLM
+- [Part 1: Design and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
+- [Part 2: Performance Status and Optimization](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2c86cee and 782aaf2.

📒 Files selected for processing (1)
  • examples/wide_ep/README.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/wide_ep/README.md

[grammar] ~82-~82: There might be a mistake here.
Context: ...e stored in shared host memory. 4 ranks on same GB200 node share the same expert w...

(QB_NEW_EN)


[grammar] ~82-~82: Ensure spelling is correct
Context: ...xpert weights to save memory. Normally, these shared host memory will be cleaned up a...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~82-~82: There might be a mistake here.
Context: ...ed up at process exit, but they may not get chance to be cleaned if an abnormal exi...

(QB_NEW_EN)


[grammar] ~96-~96: There might be a mistake here.
Context: ...aling Expert Parallelism in TensorRT-LLM - [Part 1: Design and Implementation of Lar...

(QB_NEW_EN)


[grammar] ~97-~97: There might be a mistake here.
Context: ...ign and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) - [Part 2: Performance Status and Optimizat...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
examples/wide_ep/README.md

85-85: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/wide_ep/README.md (2)

80-86: Clarify NUMA guidance; bind both CPU and memory nodes

Reword for grammar/clarity and bind both CPU and memory nodes to avoid allocations from GPU NUMA nodes. Add a tip to verify node IDs.

-### GB200 NUMA binding
-
-GPU memory are also on NUMA nodes on GB200 and system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
-```bash
-numactl -m 0,1 <command>
-```
+### GB200 NUMA binding
+
+GPU memory is also exposed as NUMA nodes on GB200, and the OS may allocate from it. Bind CPU and memory to CPU NUMA nodes to prevent GPU memory from being used as host memory.
+```bash
+numactl --cpunodebind=0,1 --membind=0,1 <command>
+```
+Tip: Use `numactl -H` to list NUMA nodes and verify that `0,1` are CPU nodes on your system.

87-96: Tighten grammar, heading capitalization, and add code-fence language

Improve readability, fix capitalization, and annotate the fenced block to satisfy markdownlint (MD040). Also clarify the error path context.

-### Shared Memory Clean Up on EPLB
-
-To achieve online load balance, all expert weights are stored in shared host memory. 4 ranks on same GB200 node share the same expert weights to save memory. Normally, these shared host memory will be cleaned up at process exit, but they may not get chance to be cleaned if an abnormal exit happens.
-
-In that case, when seeing the following (or similar) error message:
-```
-FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all'
-```
-you need to manually check `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if any.
+### Shared Memory Cleanup on EPLB
+
+To enable online load balancing, expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory. Normally, this shared memory is cleaned up on process exit, but it may not be removed after an abnormal exit.
+
+If that happens and you see an error like:
+```text
+FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all'
+```
+manually check the `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if present. (The example path may appear without the `/dev/shm` prefix in logs.)
🧹 Nitpick comments (7)
examples/wide_ep/README.md (4)

64-66: Fix heading: use “Troubleshooting”

Use the standard single word for consistency with the rest of the docs.

-## Trouble shooting
+## Troubleshooting

24-37: Clarify ‘load_balancer’ forms (inline vs external YAML) and keep examples consistent

The first example shows load_balancer as a file path, while later sections show an inline mapping. Please clarify both supported forms and when to use each, or standardize on one form for this doc.

 An example yaml file to enable wide EP:
 ```yaml
 moe_config:
     backend: WIDEEP
     max_num_tokens: 9216
-    load_balancer: moe_load_balancer.yaml # (optional) enable load balancer
+    # Load balancer can be specified in two ways:
+    # 1) As a path to an external YAML file:
+    # load_balancer: moe_load_balancer.yaml
+    # 2) Inline as a mapping (see the Online Load Balancer Configuration below).
+    load_balancer:  # (optional) enable load balancer
Parameter Description Default Notes
backend MoE backend type CUTLASS Set to WIDEEP to enable wide EP
max_num_tokens If set, at most max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time. None If the number of tokens exceeds max_num_tokens, the input tensors will be split into chunks and a for loop will be used.
- load_balancer Configuration for MoE load balancing None
+ load_balancer Configuration for MoE load balancing None

---

`49-53`: **Clarify ‘num_slots’ scope**

“Must be ≥ total experts” could be interpreted per-layer or global. Specify whether it’s the total across the entire model or per layer to prevent misconfiguration.

---

`58-59`: **Polish the offline/online note**

Tighten wording for clarity.



```diff
-*Online EP Load Balancer is more suitable for production deployment needs to react timely to the online traffic changes.*
+*The Online EP Load Balancer is generally more suitable for production deployments, as it reacts quickly to traffic changes.*
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (3)

215-226: Add hardware/applicability notes and avoid conflicts with earlier FP8 example

This section enables WIDEEP (Wide-EP), which per the support matrix applies to GB200 NVL72 with EP>8 and NVFP4. Earlier, the doc set backend: DEEPGEMM for FP8; clarify that WIDEEP is not for FP8 and should be used only in the supported scenario to avoid confusion.

 ### Wide Expert Parallelism
 
 Add the following fields to the YAML configuration file `/tmp/config.yml` to enable wide EP:
 ```yaml
 moe_config:
     backend: WIDEEP
     max_num_tokens: 9216
     load_balancer:  # configure online EP balancer
       num_slots: 288
       layer_updates_per_iter: 1

+Note:
+- WIDEEP is currently supported for GB200 NVL72 with EP > 8 and NVFP4 (see the MoE Backend Support Matrix above). It is not available for FP8.
+- If you followed the earlier FP8 example that sets backend: DEEPGEMM, switch to backend: WIDEEP only when targeting the supported GB200/NVFP4, EP>8 configuration.


---

`215-228`: **Explain max_num_tokens discrepancy vs earlier examples**

Earlier, you used `max_num_tokens: 3200`. Here, it’s `9216`. Add a brief rationale or guidance so users know which value to pick.




```diff
-Add the following fields to the YAML configuration file `/tmp/config.yml` to enable wide EP:
+Add the following fields to the YAML configuration file `/tmp/config.yml` to enable wide EP:
+<!-- For Wide-EP on GB200/NVL72, a larger `max_num_tokens` (e.g., 9216) is typically viable due to higher capacity; for other setups, use the earlier recommended 3200 or profile accordingly. -->

227-228: Cross-link offline EP configuration doc

Provide a quick pointer here as well for completeness.

-Refer to the wide EP [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep) for more details.
+Refer to the wide EP [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep) for more details, and see the [Offline EP Load Balancer](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer#offline-ep-load-balancer) documentation if you plan to precompute/avoid online updates.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 782aaf2 and b4d1ab1.

📒 Files selected for processing (2)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1 hunks)
  • examples/wide_ep/README.md (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
🪛 LanguageTool
examples/wide_ep/README.md

[grammar] ~49-~49: There might be a mistake here.
Context: ...ameter | Description | Default | Notes | |-----------|-------------|---------|---...

(QB_NEW_EN)


[grammar] ~50-~50: There might be a mistake here.
Context: ...-------|-------------|---------|-------| | num_slots | Total number of expert s...

(QB_NEW_EN)


[grammar] ~51-~51: There might be a mistake here.
Context: ...ots | None | Must be ≥ total experts | | layer_updates_per_iter | Number of l...

(QB_NEW_EN)


[grammar] ~89-~89: There might be a mistake here.
Context: ...e stored in shared host memory. 4 ranks on same GB200 node share the same expert w...

(QB_NEW_EN)


[grammar] ~89-~89: Ensure spelling is correct
Context: ...xpert weights to save memory. Normally, these shared host memory will be cleaned up a...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~89-~89: There might be a mistake here.
Context: ...ed up at process exit, but they may not get chance to be cleaned if an abnormal exi...

(QB_NEW_EN)


[grammar] ~103-~103: There might be a mistake here.
Context: ...aling Expert Parallelism in TensorRT-LLM - [Part 1: Design and Implementation of Lar...

(QB_NEW_EN)


[grammar] ~104-~104: There might be a mistake here.
Context: ...ign and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) - [Part 2: Performance Status and Optimizat...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
examples/wide_ep/README.md

92-92: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
examples/wide_ep/README.md (1)

62-63: All referenced paths in examples/wide_ep/README.md are valid

The following directories exist and match the links in the README:

  • examples/wide_ep/slurm_scripts/
  • examples/disaggregated/slurm/
  • examples/wide_ep/ep_load_balancer/

No changes needed.

Copy link
Collaborator

@dongxuy04 dongxuy04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/wide_ep/README.md (2)

99-108: Grammar fixes and clearer heading in EPLB cleanup; add fenced code language (markdownlint MD040).

Improve readability and fix capitalization; specify the language for the error snippet.

-### Shared Memory Clean Up on EPLB
-
-To achieve online load balance, all expert weights are stored in shared host memory. 4 ranks on same GB200 node share the same expert weights to save memory. Normally, these shared host memory will be cleaned up at process exit, but they may not get chance to be cleaned if an abnormal exit happens.
-
-In that case, when seeing the following (or similar) error message:
-```
-FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all'
-```
-you need to manually check `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if any.
+### Shared Memory Cleanup on EPLB
+
+To enable online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory. Normally, this shared host memory is cleaned up on process exit, but it may not be removed after an abnormal exit.
+
+If that happens and you see an error like:
+```text
+FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all'
+```
+manually check the `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if present.

92-98: Bind both CPU and memory NUMA nodes; reword for clarity (and avoid accidental allocation from GPU NUMA nodes).

Binding only memory (-m) can still allow CPU allocations on GPU NUMA nodes on GB200. Bind CPU and memory nodes together and add a short verification tip.

-### GB200 NUMA binding
-
-GPU memory are also on NUMA nodes on GB200 and system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
-```bash
-numactl -m 0,1 <command>
-```
+### GB200 NUMA binding
+
+GPU memory is also exposed as NUMA nodes on GB200, and the OS may allocate from it. Bind allocations to CPU NUMA nodes to prevent GPU memory from being used as host memory.
+```bash
+numactl --cpunodebind=0,1 --membind=0,1 <command>
+```
+Tip: Use `numactl -H` to list NUMA nodes and verify that `0,1` are CPU nodes on your system.
🧹 Nitpick comments (4)
examples/wide_ep/README.md (4)

22-29: Fix list style and minor grammar in Prerequisites (markdownlint MD004, phrasing).

Use dashes for unordered lists and tighten wording/capitalization.

-### Prerequisites
-
-* GPU: GB200 NVL72, H20, or RTX PRO 6000 Blackwell Workstation Edition.
-* OS: Linux
-* Drivers: CUDA Driver 575 or Later
-* Docker with NVIDIA Container Toolkit installed
-* Python3 and python3-pip (Optional, for accuracy evaluation only)
+### Prerequisites
+
+- GPU: GB200 NVL72, H20, or RTX PRO 6000 Blackwell Workstation Edition.
+- OS: Linux
+- Drivers: CUDA driver 575 or later
+- Docker with NVIDIA Container Toolkit installed
+- Python 3 and pip3 (optional; for accuracy evaluation only)

30-33: Use “set up” (verb), remove bare URL (markdownlint MD034), and clarify wording.

This avoids lint warnings and reads more clearly.

-For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.
-
-For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html.
+For GB200 NVL72, to ensure that Multi-Node NVLink (MNNVL) is correctly set up, check that the path `/dev/nvidia-caps-imex-channels` exists in the container. If it is missing, mount it when launching the Docker container.
+
+For more information on the NVIDIA IMEX service for NVLink networks, see the NVIDIA Multi-Node NVLink Systems IMEX Guide: <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html>.

50-59: Clarify whether load_balancer accepts a file path, inline mapping, or both.

The earlier example uses a file path (moe_load_balancer.yaml) while this example shows an inline mapping. If both are supported, state it explicitly to avoid confusion; if only one is supported, make the examples consistent.

Would you like me to draft a short “Note:” under the examples that says: “load_balancer can be a path to a YAML file or an inline mapping” with a minimal example of each?


66-71: Tighten grammar in the production-suitability note.

The current sentence is awkward.

-*Online EP Load Balancer is more suitable for production deployment needs to react timely to the online traffic changes.*
+*The Online EP Load Balancer is better suited for production deployments because it can react quickly to traffic changes.*
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b4d1ab1 and f2125e2.

📒 Files selected for processing (1)
  • examples/wide_ep/README.md (3 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/wide_ep/README.md

[grammar] ~25-~25: There might be a mistake here.
Context: ...ackwell Workstation Edition. * OS: Linux * Drivers: CUDA Driver 575 or Later * Dock...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...inux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...

(QB_NEW_EN)


[grammar] ~27-~27: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...

(QB_NEW_EN)


[grammar] ~61-~61: There might be a mistake here.
Context: ...ameter | Description | Default | Notes | |-----------|-------------|---------|---...

(QB_NEW_EN)


[grammar] ~62-~62: There might be a mistake here.
Context: ...-------|-------------|---------|-------| | num_slots | Total number of expert s...

(QB_NEW_EN)


[grammar] ~63-~63: There might be a mistake here.
Context: ...ots | None | Must be ≥ total experts | | layer_updates_per_iter | Number of l...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...e stored in shared host memory. 4 ranks on same GB200 node share the same expert w...

(QB_NEW_EN)


[grammar] ~101-~101: Ensure spelling is correct
Context: ...xpert weights to save memory. Normally, these shared host memory will be cleaned up a...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~101-~101: There might be a mistake here.
Context: ...ed up at process exit, but they may not get chance to be cleaned if an abnormal exi...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...aling Expert Parallelism in TensorRT-LLM - [Part 1: Design and Implementation of Lar...

(QB_NEW_EN)


[grammar] ~116-~116: There might be a mistake here.
Context: ...ign and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) - [Part 2: Performance Status and Optimizat...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
examples/wide_ep/README.md

24-24: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


25-25: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


26-26: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


27-27: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


28-28: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


32-32: Bare URL used

(MD034, no-bare-urls)


104-104: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
@kaiyux
Copy link
Member Author

kaiyux commented Aug 19, 2025

/bot skip --comment "doc changes"

@kaiyux kaiyux enabled auto-merge (squash) August 19, 2025 10:07
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/wide_ep/README.md (2)

92-98: Clarify NUMA guidance, bind both CPU and memory nodes, and fix grammar.

Bind CPU and memory nodes together to avoid accidental allocation from GPU NUMA nodes. Add a tip to verify node IDs.

-### GB200 NUMA binding
-
-GPU memory are also on NUMA nodes on GB200 and system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
-```bash
-numactl -m 0,1 <command>
-```
+### GB200 NUMA binding
+
+GPU memory is also exposed as NUMA nodes on GB200, and the OS may allocate from it. Bind to CPU NUMA nodes for both CPU and memory to prevent GPU memory from being used as host memory.
+```bash
+numactl --cpunodebind=0,1 --membind=0,1 <command>
+```
+Tip: Run `numactl -H` to list NUMA nodes and confirm that `0,1` are CPU nodes on your system.

99-108: Grammar fixes and clearer heading in EPLB cleanup section; add code fence language (MD040).

Tighten language, correct capitalization, and specify code block language.

-### Shared Memory Clean Up on EPLB
-
-To achieve online load balance, all expert weights are stored in shared host memory. 4 ranks on same GB200 node share the same expert weights to save memory. Normally, these shared host memory will be cleaned up at process exit, but they may not get chance to be cleaned if an abnormal exit happens.
-
-In that case, when seeing the following (or similar) error message:
-```
-FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all'
-```
-you need to manually check `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if any.
+### Shared Memory Cleanup on EPLB
+
+To enable online load balancing, expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory. Normally, this shared memory is cleaned up on process exit, but it may not be removed after an abnormal exit.
+
+If that happens and you see an error like:
+```text
+FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all'
+```
+manually check the `/dev/shm` directory and delete any `/dev/shm/moe_shared_*` entries.
🧹 Nitpick comments (7)
examples/wide_ep/README.md (7)

24-29: Tighten prerequisites wording and align list style with markdownlint (MD004).

Use dashes, fix capitalization, and avoid trailing periods for consistency.

-* GPU: GB200 NVL72, H20, or RTX 6000D.
-* OS: Linux
-* Drivers: CUDA Driver 575 or Later
-* Docker with NVIDIA Container Toolkit installed
-* Python3 and python3-pip (Optional, for accuracy evaluation only)
+- GPU: GB200 NVL72, H20, or RTX 6000D
+- OS: Linux
+- Driver: NVIDIA CUDA driver 575 or later
+- Docker with NVIDIA Container Toolkit installed
+- Python 3 and pip (optional; for accuracy evaluation only)

30-33: Fix “set up”, add a concrete mount example, and replace bare URL with a titled link (MD034).

Small grammar improvement and actionable example reduce ambiguity; link text improves readability.

-For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.
-
-For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html.
+For GB200 NVL72, to ensure Multi-Node NVLink (MNNVL) is correctly set up, check whether `/dev/nvidia-caps-imex-channels` exists inside the container. If it does not exist, bind-mount it when launching the Docker container. For example:
+```bash
+docker run --gpus all --rm -it \
+  --net=host --ipc=host \
+  -v /dev/nvidia-caps-imex-channels:/dev/nvidia-caps-imex-channels:ro \
+  <image> ...
+```
+
+For more information on the NVIDIA IMEX service for NVLink networks, see the [IMEX Guide](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html).

36-42: Clarify whether load_balancer accepts a file path and/or inline object; avoid mixed examples without context.

The section below shows inline fields, while this snippet uses a file path. Confirm both forms are supported, and make the comment explicit.

-    load_balancer: moe_load_balancer.yaml # (optional) enable load balancer
+    load_balancer: moe_load_balancer.yaml # optional: path to a YAML config (see below for inline option)

If only one form is supported, align both examples accordingly. If both are supported, consider adding brief “Option A (file)” / “Option B (inline)” subheadings.


66-71: Reword the offline/online guidance for clarity.

Streamline the sentence and fix grammar.

-*Online EP Load Balancer is more suitable for production deployment needs to react timely to the online traffic changes.*
+*The Online EP Load Balancer is more suitable for production deployments because it reacts promptly to traffic changes.*

76-76: Fix heading typo: “Trouble shooting” → “Troubleshooting”.

Improves professionalism in user-facing docs.

-## Trouble shooting
+## Troubleshooting

80-90: Transparent Huge Pages: tighten wording and add a sudo-safe command.

  • Use “Transparent Huge Pages (THP)” and correct subject-verb agreement.
  • Remove shell prompt markers in code blocks.
  • Use tee with sudo to avoid redirection permission issues.
-When getting exception `madvise(MADV_HUGEPAGE) failed.`, check if Transparent Hugepages has been enabled.
+If you see `madvise(MADV_HUGEPAGE) failed.`, verify that Transparent Huge Pages (THP) are enabled.
 ```bash
->$ cat /sys/kernel/mm/transparent_hugepage/enabled
+cat /sys/kernel/mm/transparent_hugepage/enabled
 always [madvise] never
->$ cat /sys/kernel/mm/transparent_hugepage/defrag
+cat /sys/kernel/mm/transparent_hugepage/defrag
 always defer defer+madvise [madvise] never

-If never is highlighted, enable Transparent HugePages by the following command.
+If never is highlighted, enable Transparent Huge Pages with:

-echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
+echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >/dev/null

---

`61-65`: **Minor table clarity: specify “number of experts”.**

Small wording tweak improves precision.


```diff
-| `num_slots` | Total number of expert slots | `None` | Must be ≥ total experts |
+| `num_slots` | Total number of expert slots | `None` | Must be ≥ total number of experts |
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f2125e2 and 23f0ab2.

📒 Files selected for processing (1)
  • examples/wide_ep/README.md (3 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/wide_ep/README.md

[grammar] ~25-~25: There might be a mistake here.
Context: ...00 NVL72, H20, or RTX 6000D. * OS: Linux * Drivers: CUDA Driver 575 or Later * Dock...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...inux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...

(QB_NEW_EN)


[grammar] ~27-~27: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...

(QB_NEW_EN)


[grammar] ~61-~61: There might be a mistake here.
Context: ...ameter | Description | Default | Notes | |-----------|-------------|---------|---...

(QB_NEW_EN)


[grammar] ~62-~62: There might be a mistake here.
Context: ...-------|-------------|---------|-------| | num_slots | Total number of expert s...

(QB_NEW_EN)


[grammar] ~63-~63: There might be a mistake here.
Context: ...ots | None | Must be ≥ total experts | | layer_updates_per_iter | Number of l...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...e stored in shared host memory. 4 ranks on same GB200 node share the same expert w...

(QB_NEW_EN)


[grammar] ~101-~101: Ensure spelling is correct
Context: ...xpert weights to save memory. Normally, these shared host memory will be cleaned up a...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~101-~101: There might be a mistake here.
Context: ...ed up at process exit, but they may not get chance to be cleaned if an abnormal exi...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...aling Expert Parallelism in TensorRT-LLM - [Part 1: Design and Implementation of Lar...

(QB_NEW_EN)


[grammar] ~116-~116: There might be a mistake here.
Context: ...ign and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) - [Part 2: Performance Status and Optimizat...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
examples/wide_ep/README.md

24-24: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


25-25: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


26-26: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


27-27: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


28-28: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


32-32: Bare URL used

(MD034, no-bare-urls)


104-104: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
examples/wide_ep/README.md (3)

50-59: Online LB YAML example reads well.

Inline config is clear and consistent with the table below.


72-75: SLURM guidance looks good.

Linking to slurm_scripts and disaggregated scripts is helpful for users.


115-117: References section looks good.

Links and labels are clear and helpful.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15768 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15768 [ skip ] completed with state SUCCESS
Skipping testing for commit 23f0ab2

@kaiyux kaiyux merged commit 9a74ee9 into NVIDIA:main Aug 19, 2025
5 checks passed
@kaiyux kaiyux deleted the user/kaiyu/large_ep_doc branch August 19, 2025 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants