KEMBAR78
[None][doc] Update gpt oss doc by bobboli · Pull Request #6954 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@bobboli
Copy link
Collaborator

@bobboli bobboli commented Aug 15, 2025

Summary by CodeRabbit

  • Documentation
    • Reworked deployment guide into an end‑to‑end flow (install → benchmark → serve), added Day‑0 support for GPT‑OSS and clarified tps/user vs tps/gpu with sample targets.
    • Added install paths (NGC dev image, build‑from‑source, regular releases); shifted to GPU‑centric prerequisites, weight caching, local model path, dataset prep, and env‑driven config variables.
    • Unified benchmarking with trtllm‑bench, gave explicit benchmark examples, updated launch/run guidance, sample requests/health checks, and expanded MoE/Triton H200 guidance.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
@bobboli bobboli requested review from hlu1 and jgangani August 15, 2025 16:11
@bobboli bobboli requested a review from a team as a code owner August 15, 2025 16:11
@bobboli bobboli requested review from QiJune and nv-guomingz August 15, 2025 16:11
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 15, 2025

📝 Walkthrough

Walkthrough

Rewrites the GPT‑OSS deployment blog into an install → benchmark → serve workflow: adds Day‑0 GPT‑OSS support, GPU‑centric prerequisites, multiple install paths (NGC dev image, build from source, releases), replaces mpirun-based serve/bench steps with trtllm-bench and env-driven configs, expands weight/cache handling, adds H200/Triton MoE guidance, and updates testing and troubleshooting.

Changes

Cohort / File(s) Summary of Changes
Tech blog: deploy & benchmark overhaul
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Complete rewrite of the deployment guide: introduces Day‑0 support for gpt‑oss-120b; switches prerequisites to a GPU-centric baseline (B200/GB200/H200); adds install options (NGC dev container, build from source, releases); reorganizes launch flow (NGC dev container, explicit docker run, removes TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL); explains weight caching (~/.cache) and local model path (${local_model_path}); replaces per‑GPU/mpirun instructions with unified Performance Benchmarking and Model Serving workflow using trtllm-bench and env vars (${max_batch_size}, ${num_gpus}); consolidates latency and throughput benchmarking commands and targets; adds H200/Triton MoE backend guidance and OpenAI‑compatible server testing (curl example); expands troubleshooting and notes on init times.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Installer
  participant Cache as ModelCache
  participant Bench as trtllm-bench
  participant Server as TRT-LLM Server

  User->>Installer: Install TensorRT-LLM (NGC/dev image, pip, or build from source)
  User->>User: Configure env vars (max_batch_size, num_gpus, local_model_path)
  User->>Cache: Populate ~/.cache or ${local_model_path} with weights
  User->>Bench: Run trtllm-bench (dataset, concurrency, batch size)
  Bench->>Cache: Load model/weights
  Bench-->>User: Report tps/user and tps/gpu
  User->>Server: Launch TRT-LLM Server (docker run / trtllm-serve)
  User->>Server: Send sample request (curl)
  Server-->>User: Return JSON response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

1.0_doc

Suggested reviewers

  • schetlur-nv
  • nv-guomingz
  • juney-nvidia
  • nekorobov
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@bobboli
Copy link
Collaborator Author

bobboli commented Aug 15, 2025

/bot skip --comment "doc update"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)

277-346: Replace unrealistic example response with a concise, valid OpenAI-style JSON

The current example includes non-API tokens (“<|channel|>analysis…”) and an excessively long, markdown-heavy response that may confuse users. Provide a minimal, realistic response that mirrors the server’s schema.

Proposed replacement:

-```bash
-{
-  "id": "chatcmpl-c440e2a3e7e14cd699295afc3739bf42",
-  "object": "chat.completion",
-  "created": 1754358426,
-  "model": "openai/gpt-oss-120b",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "role": "assistant",
-        "content": "<|channel|>analysis<|message|>The user asks: \"What is NVIDIA's advantage for inference?\" ...
-        ... (omitted for brevity) ...
-      },
-      "logprobs": null,
-      "finish_reason": "length",
-      "stop_reason": null,
-      "disaggregated_params": null
-    }
-  ],
-  "usage": {
-    "prompt_tokens": 17,
-    "total_tokens": 1041,
-    "completion_tokens": 1024
-  },
-  "prompt_token_ids": null
-}
-```
+```bash
+{
+  "id": "chatcmpl-1234567890",
+  "object": "chat.completion",
+  "created": 1721000000,
+  "model": "openai/gpt-oss-120b",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "NVIDIA’s inference advantage comes from specialized Tensor Cores, the TensorRT compiler/runtime, and an optimized software stack (CUDA, cuDNN, Triton) that together deliver high throughput at low latency."
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 17,
+    "completion_tokens": 42,
+    "total_tokens": 59
+  }
+}
+```
🧹 Nitpick comments (8)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (8)

54-55: Grammar: “is has been” → “has been”

Minor but visible grammar issue.

-The support for gpt-oss is has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM.
+Support for gpt-oss has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM.

27-28: Grammar: “follow docker command” → “following docker command”

-Run the follow docker command to start the TensorRT-LLM container in interactive mode:
+Run the following docker command to start the TensorRT-LLM container in interactive mode:

23-23: Markdown style: remove extra space after heading marker

-###  NGC Docker Image of dev branch
+### NGC Docker Image of dev branch

140-141: Typo: “requeests” → “requests”; tighten the wording

Also reads better as “a sufficient number of requests.”

-`--max_batch_size` controls the maximum batch size that the inference engine could serve, while `--concurrency` is the number of concurrent requests that the benchmarking client is sending. `--num_requests` is set to 10 times of `--concurrency` to run enough number of requeests.
+`--max_batch_size` controls the maximum batch size the inference engine can serve, while `--concurrency` is the number of concurrent requests the benchmarking client sends. `--num_requests` is set to 10× `--concurrency` to run a sufficient number of requests.

174-178: Grammar and tone: bullet intros should be parallel and capitalized

“Compare to” → “Compared to”; start each bullet with “Set …” for consistency.

-Compare to the low-latency configuration, we:
-- set `enable_attention_dp` to `true` to use attention DP which is better for high throughput.
-- set `stream_interval` to 10 to stream the results to the client every 10 tokens. At high concurrency the detokenization overhead of the streaming mode cannot be hidden under GPU execution time, `stream_interval` is a workaround to reduce the overhead.
-- set `moe_config.backend` to `CUTLASS` to use the `CUTLASS` MoE kernels which are optimized for high throughput.
+Compared to the low-latency configuration, we:
+- Set `enable_attention_dp` to `true` to use attention DP, which is better for high throughput.
+- Set `stream_interval` to 10 to stream the results to the client every 10 tokens. At high concurrency, the detokenization overhead of streaming cannot be hidden under GPU execution time; `stream_interval` mitigates the overhead.
+- Set `moe_config.backend` to `CUTLASS` to use MoE kernels optimized for high throughput.

269-269: Typo in sample request payload: missing apostrophe

JSON example should read “NVIDIA's advantage”.

-            "content": "What is NVIDIAs advantage for inference?"
+            "content": "What is NVIDIA's advantage for inference?"

50-51: Minor readability: add commas and possessive

Nit, but this sentence reads more naturally with punctuation and possessive.

-Lastly the container mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default.
+Lastly, the container mounts your user's `.cache` directory to save downloaded model checkpoints, which are stored in `~/.cache/huggingface/hub/` by default.

367-374: Minor grammar in Troubleshooting bullets

A few small tweaks improve clarity.

-- Add `print_iter_log: true` to extra LLM API options YAML file to inspect the per-iteration log.
+- Add `print_iter_log: true` to the extra LLM API options YAML file to inspect the per-iteration log.
-- For performance issues, check GPU utilization with `nvidia-smi` while the server is running
+- For performance issues, check GPU utilization with `nvidia-smi` while the server is running.
-- If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
+- If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
-- For connection issues, make sure port 8000 is not being used by another application
+- For connection issues, make sure port 8000 is not being used by another application.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6037fe3 and 68c275f.

📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~13-~13: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...

(QB_NEW_EN)


[grammar] ~42-~42: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...

(QB_NEW_EN)


[style] ~112-~112: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[grammar] ~144-~144: There might be a mistake here.
Context: ...ieve 420 tps/user with 8x B200 GPUs and max batch size 1. ### Max-Throughput ...

(QB_NEW_EN)


[grammar] ~149-~149: There might be a mistake here.
Context: ...1k and osl=2k dataset, currently we can achieve batch size 640 with 8x B200 GPUs. ```b...

(QB_NEW_EN)


[grammar] ~175-~175: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream th...

(QB_NEW_EN)


[grammar] ~176-~176: There might be a mistake here.
Context: ... is a workaround to reduce the overhead. - set moe_config.backend to CUTLASS to...

(QB_NEW_EN)


[grammar] ~203-~203: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...

(QB_NEW_EN)


[grammar] ~370-~370: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...

(QB_NEW_EN)


[grammar] ~370-~370: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - For performance issues, check GPU utiliz...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

23-23: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

🔇 Additional comments (4)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4)

62-66: Solid end-to-end restructuring and clarity improvements

The doc now reads like a reproducible workflow (install → benchmark → serve), with clear separation of low-latency vs. throughput paths and environment-driven configs. Nice work consolidating benchmarking around trtllm-bench and calling out MoE/H200 guidance explicitly.


203-208: Verify “DP4EP4 with 4x B200 GPUs” claim and terminology

“DP4EP4 using 4x B200 GPUs” reads as 16-way parallelism (4×4). If “DP” here refers to attention data-parallel shards (not classic replica DP), the phrasing can mislead. Please confirm the intended configuration and consider clarifying the meaning of DP in this context (e.g., “attention DP degree 4 with EP 4 on 4 GPUs”).

Would you like me to propose a short clarification sentence once you confirm the exact parallelism topology?


118-138: Confirm subcommand alignment: low-latency section uses throughput mode

For low-latency tps/user evaluation, confirm trtllm-bench throughput is the intended subcommand (vs. a dedicated latency mode, if available). If “throughput” is correct, consider a brief note that tps/user at concurrency=1 is a good proxy for min-latency.

If there is a latency-specific mode, I can help rework the snippet accordingly.


69-79: Confirmed: Hugging Face handle openai/gpt-oss-120b is valid and publicly accessible

Verified: the handle exists on Hugging Face (model page and models API returned HTTP 200). The tokenizer/model references in the doc are correct and metadata is publicly reachable (no special auth required to view the model page/API).

Locations referencing the handle (for your awareness):

  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md — lines ~72 (tokenizer), 120–130, 180–190 (trtllm-bench --model), ~217, ~235, ~265, ~284 (JSON/examples)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py — around lines 2490, 2521, 2559, 2589 (local gpt-oss references)
  • examples/models/core/gpt_oss/README.md
  • README.md (blog link)

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15457 [ skip ] triggered by Bot

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)

217-227: Fix broken multi-line command (line-continuation) and incorrect KV-cache flag in low-latency serve example

  • Inline comments after a trailing backslash break shell line-continuation (Lines 217 and 225).
  • Flag name should be --kv_cache_free_gpu_mem_fraction to match CLI and earlier examples (Line 224).

Apply:

-  gpt-oss-120b \  # Or ${local_model_path}
+  gpt-oss-120b \
+  # Or, alternatively, pass a local path instead of the model name:
+  # ${local_model_path}
@@
-  --kv_cache_free_gpu_memory_fraction 0.9 \
+  --kv_cache_free_gpu_mem_fraction 0.9 \
@@
-  --max_batch_size ${max_batch_size} \  # E.g., 1
+  --max_batch_size ${max_batch_size} \
+  # E.g., 1

234-245: Repeat: fix line-continuation and KV-cache flag in max-throughput serve example

The same two issues appear here (Lines 235, 242, and 243).

-  gpt-oss-120b \  # Or ${local_model_path}
+  gpt-oss-120b \
+  # Or, alternatively, pass a local path instead of the model name:
+  # ${local_model_path}
@@
-  --kv_cache_free_gpu_memory_fraction 0.9 \
+  --kv_cache_free_gpu_mem_fraction 0.9 \
@@
-  --max_batch_size ${max_batch_size} \  # E.g., 640 
+  --max_batch_size ${max_batch_size} \
+  # E.g., 640
🧹 Nitpick comments (5)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)

23-23: Fix markdown heading spacing (lint: MD019)

Remove the extra space after the ATX heading marker.

-###  NGC Docker Image of dev branch
+### NGC Docker Image of dev branch

54-54: Grammar fix and clarity improvements

-The support for gpt-oss is has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM. We are continuing to optimize the performance of gpt-oss, you can build the TensorRT-LLM from source to get the latest features and support. Please refer to the [doc](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) if you want to build from source by yourself.
+The support for gpt-oss has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM. We are continuing to optimize the performance of gpt-oss; you can build TensorRT-LLM from source to get the latest features and support. Please refer to the [doc](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) if you want to build from source yourself.

140-140: Typos and concise phrasing in benchmark description

-`--max_batch_size` controls the maximum batch size that the inference engine could serve, while `--concurrency` is the number of concurrent requests that the benchmarking client is sending. `--num_requests` is set to 10 times of `--concurrency` to run enough number of requeests.
+`--max_batch_size` controls the maximum batch size that the inference engine can serve, while `--concurrency` is the number of concurrent requests that the benchmarking client is sending. `--num_requests` is set to 10× `--concurrency` to run enough requests.

174-176: Grammar: “Compared to”

-Compare to the low-latency configuration, we:
+Compared to the low-latency configuration, we:

355-360: Clarify Triton install vs preinstalled kernels in NGC image

The text first states the NGC container already includes Triton MoE kernels (no install needed), then immediately instructs building/installing Triton. Clarify this to avoid confusion.

-OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper based GPUs like NVIDIA's H200 for best performance. The NGC TensorRT-LLM container image mentioned above already includes the required kernels so you do not need to build or install them. It is highly recommended to enable them with the steps below:
+OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for best performance. The NGC TensorRT-LLM container image mentioned above already includes the required kernels, so you typically do not need to build or install them. If you are building from source or using a base image without these kernels, follow the steps below:
@@
-### Installing OpenAI Triton
+### Triton installation (only if building from source)
@@
-You will need to build and install a specific version of Triton. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe).
+If you are building TensorRT-LLM from source or using a base image without the prebuilt kernels, build and install the required Triton version by following this [guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe). Otherwise (when using the NGC container above), you can skip this step.

Please confirm the NGC dev image tag indeed includes the Triton MoE kernels for the gpt-oss path so the “skip” guidance is accurate for all users of that image.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 68c275f and 2927e2a.

📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~13-~13: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...

(QB_NEW_EN)


[grammar] ~42-~42: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...

(QB_NEW_EN)


[grammar] ~43-~43: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...

(QB_NEW_EN)


[grammar] ~44-~44: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...

(QB_NEW_EN)


[grammar] ~45-~45: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...

(QB_NEW_EN)


[grammar] ~46-~46: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...

(QB_NEW_EN)


[grammar] ~47-~47: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for low-latency perf optimiz...

(QB_NEW_EN)


[style] ~112-~112: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[grammar] ~144-~144: There might be a mistake here.
Context: ...ieve 420 tps/user with 8x B200 GPUs and max batch size 1. ### Max-Throughput ...

(QB_NEW_EN)


[grammar] ~149-~149: There might be a mistake here.
Context: ...1k and osl=2k dataset, currently we can achieve batch size 640 with 8x B200 GPUs. ```b...

(QB_NEW_EN)


[grammar] ~175-~175: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream th...

(QB_NEW_EN)


[grammar] ~176-~176: There might be a mistake here.
Context: ... is a workaround to reduce the overhead. - set moe_config.backend to CUTLASS to...

(QB_NEW_EN)


[grammar] ~203-~203: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...

(QB_NEW_EN)


[grammar] ~375-~375: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...

(QB_NEW_EN)


[grammar] ~375-~375: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - For performance issues, check GPU utiliz...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

23-23: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)


115-115: Link fragments should be valid

(MD051, link-fragments)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15457 [ skip ] completed with state SUCCESS
Skipping testing for commit 68c275f

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
@bobboli bobboli enabled auto-merge (squash) August 15, 2025 17:01
@bobboli bobboli disabled auto-merge August 15, 2025 17:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)

215-226: Great serve section; a couple of blocking shell issues to fix (line-continuation + wrong flag).

  • The inline comments after backslashes on Line 216 and Line 224 break the multi-line command. In bash, the backslash must be the last character on the line.
  • The flag on Line 223 should be --kv_cache_free_gpu_mem_fraction (to match bench commands and the CLI).

Apply this diff:

 trtllm-serve \
-  gpt-oss-120b \  # Or ${local_model_path}
+  gpt-oss-120b \
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size 1  \
   --extra_llm_api_options low_latency.yaml \
-  --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 1
+  --kv_cache_free_gpu_mem_fraction 0.9 \
+  --max_batch_size ${max_batch_size} \
   --trust_remote_code

233-244: Repeat the same fixes in the max-throughput serve command.

  • Remove the inline comment after the backslash (Line 234).
  • Use the canonical kv-cache flag name (Line 241).
  • Remove the inline comment after the backslash (Line 242).
 trtllm-serve \
-  gpt-oss-120b \  # Or ${local_model_path}
+  gpt-oss-120b \
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size ${num_gpus} \
   --extra_llm_api_options max_throughput.yaml \
-  --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 640 
+  --kv_cache_free_gpu_mem_fraction 0.9 \
+  --max_batch_size ${max_batch_size} \
   --trust_remote_code

114-114: Fix the broken intra-doc anchor to match the section title.

The linked section is “(H200/H100 Only) Using OpenAI Triton Kernels for MoE”, so the anchor should be #h200h100-only-using-openai-triton-kernels-for-moe.

-> Note: If you are using NVIDIA H200 GPUs it is highly recommended to set the `moe_config.backend` to TRITON to use the OpenAI Triton MoE kernel. See the section [(H200 Only) Using OpenAI Triton Kernels for MoE](#h200-only-using-openai-triton-kernels-for-moe) for more details.
+> Note: If you are using NVIDIA H200 GPUs it is highly recommended to set the `moe_config.backend` to TRITON to use the OpenAI Triton MoE kernel. See the section [(H200/H100 Only) Using OpenAI Triton Kernels for MoE](#h200h100-only-using-openai-triton-kernels-for-moe) for more details.
🧹 Nitpick comments (5)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)

22-22: Markdown lint: remove extra space in the heading.

-###  NGC Docker Image of dev branch
+### NGC Docker Image of dev branch

206-206: Minor readability: add spaces/clarify “>4GPUs”.

-... implementation for >4GPUs is suboptimal ...
+... implementation for > 4 GPUs is suboptimal ...

28-38: Optional: run container as the host user to avoid root-owned files in ~/.cache.

Mounting ~/.cache into /root/.cache as root can leave root-owned artifacts on the host. Consider running the container with the host UID/GID.

-docker run --rm --ipc=host -it \
+docker run --rm --ipc=host -it \
+  --user $(id -u):$(id -g) \
   --ulimit stack=67108864 \
   --ulimit memlock=-1 \
   --gpus all \
   -p 8000:8000 \
   -e TRTLLM_ENABLE_PDL=1 \
   -v ~/.cache:/root/.cache:rw \
   nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev \
   /bin/bash

215-226: Security note: only set --trust_remote_code if needed.

Recommend adding a short caution that --trust_remote_code executes arbitrary model code; users should enable it only when necessary and with trusted sources.


353-359: Nice clarification on Triton install; small wording tweak for clarity.

Consider rephrasing slightly for flow.

-The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` has Triton installed already (`echo $TRITON_ROOT` could reveal the path). In other situations, you will need to build and install a specific version of Triton. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe).
+The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` image includes Triton already (you can verify with `echo $TRITON_ROOT`). If you’re not using the dev container, you’ll need to build/install a specific Triton version; follow the instructions here: [Using OpenAI Triton Kernels for MoE](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe).
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2927e2a and c9a3623.

📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~3-~3: There might be a mistake here.
Context: ... walk you through how to launch your own high-performance TensorRT-LLM server for...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...

(QB_NEW_EN)


[grammar] ~41-~41: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...

(QB_NEW_EN)


[grammar] ~42-~42: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...

(QB_NEW_EN)


[grammar] ~43-~43: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...

(QB_NEW_EN)


[grammar] ~44-~44: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...

(QB_NEW_EN)


[grammar] ~45-~45: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...

(QB_NEW_EN)


[grammar] ~46-~46: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for performance optimization...

(QB_NEW_EN)


[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[grammar] ~148-~148: There might be a mistake here.
Context: ...ghput that saturates the system's GPUs. Using input sequence length (isl) of 1k and o...

(QB_NEW_EN)


[grammar] ~148-~148: There might be a mistake here.
Context: ...Using input sequence length (isl) of 1k and output sequence length (osl) of 2k, we ...

(QB_NEW_EN)


[grammar] ~174-~174: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream re...

(QB_NEW_EN)


[grammar] ~202-~202: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...

(QB_NEW_EN)


[grammar] ~206-~206: Ensure spelling is correct
Context: ...ut the communication implementation for >4GPUs is suboptimal and we are actively worki...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~374-~374: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...

(QB_NEW_EN)


[grammar] ~374-~374: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - Check GPU utilization with nvidia-smi ...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

22-22: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)


114-114: Link fragments should be valid

(MD051, link-fragments)

🔇 Additional comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)

121-137: Double-specifying both --model and --model_path may be confusing.

It’s typically either a model ID or a local path. If both are accepted, fine; otherwise, we should pick one and document the alternative separately to avoid user error.

Would you like me to adjust the examples to provide two separate blocks (one with --model, one with --model_path) to remove ambiguity?

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)

216-216: Fix bash line-continuation: inline comments after backslashes break the command

The inline comment after the trailing backslash ends the continuation and will cause the next flags to be parsed as a separate command. Move comments to their own lines.

 trtllm-serve \
-  gpt-oss-120b \  # Or ${local_model_path}
+  gpt-oss-120b \
+  # Or, alternatively, pass a local path instead of the model name:
+  # ${local_model_path}
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size 1  \
   --extra_llm_api_options low_latency.yaml \
-  --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 1
+  --kv_cache_free_gpu_mem_fraction 0.9 \
+  --max_batch_size ${max_batch_size} \
+  # E.g., 1
   --trust_remote_code

Also applies to: 224-224


223-223: Use the correct kv-cache flag: --kv_cache_free_gpu_mem_fraction

Serve command uses an invalid flag spelling; bench uses the correct one. Align for consistency and to avoid runtime errors.

-  --kv_cache_free_gpu_memory_fraction 0.9 \
+  --kv_cache_free_gpu_mem_fraction 0.9 \

234-234: Fix same line-continuation issue in max-throughput serve command

Inline comments after backslashes break the multi-line command. Move them to separate lines.

 trtllm-serve \
-  gpt-oss-120b \  # Or ${local_model_path}
+  gpt-oss-120b \
+  # Or, alternatively, pass a local path instead of the model name:
+  # ${local_model_path}
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size ${num_gpus} \
   --extra_llm_api_options max_throughput.yaml \
-  --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 640 
+  --kv_cache_free_gpu_mem_fraction 0.9 \
+  --max_batch_size ${max_batch_size} \
+  # E.g., 640
   --trust_remote_code

Also applies to: 242-242

🧹 Nitpick comments (4)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4)

210-214: Make env var dependencies explicit before serve commands

The serve snippets rely on num_gpus and max_batch_size defined earlier. Add a short preamble or reminder to set them to avoid copy/paste failures.

 ## Launch the TensorRT-LLM Server

 We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
+
+```bash
+# Ensure these are set (reuse values from the benchmarking section or adjust as needed)
+num_gpus=8
+max_batch_size=1
+```

22-22: Fix markdownlint MD019: remove double space after heading hashes

Single space after the hashes avoids markdownlint warnings and renders consistently.

-###  NGC Docker Image of dev branch
+### NGC Docker Image of dev branch

49-50: Clarify local_model_path usage with a concrete example

A brief example reduces confusion when users choose a local path over downloading weights.

 Additionally, the container mounts your user `.cache` directory to save the downloaded model checkpoints, which are stored in `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. You can also download the weights to a custom location (we assume `${local_model_path}` is the path to the local model weights).
+
+For example:
+```bash
+local_model_path=/data/models/openai/gpt-oss-120b
+```

206-206: Minor wording nit: spacing around numeric comparator

“>4GPUs” reads better as “> 4 GPUs”.

-... implementation for >4GPUs is suboptimal ...
+... implementation for > 4 GPUs is suboptimal ...
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c9a3623 and f0aa30f.

📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~3-~3: There might be a mistake here.
Context: ... walk you through how to launch your own high-performance TensorRT-LLM server for...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...

(QB_NEW_EN)


[grammar] ~41-~41: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...

(QB_NEW_EN)


[grammar] ~42-~42: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...

(QB_NEW_EN)


[grammar] ~43-~43: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...

(QB_NEW_EN)


[grammar] ~44-~44: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...

(QB_NEW_EN)


[grammar] ~45-~45: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...

(QB_NEW_EN)


[grammar] ~46-~46: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for performance optimization...

(QB_NEW_EN)


[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[grammar] ~148-~148: There might be a mistake here.
Context: ...ghput that saturates the system's GPUs. Using input sequence length (isl) of 1k and o...

(QB_NEW_EN)


[grammar] ~148-~148: There might be a mistake here.
Context: ...Using input sequence length (isl) of 1k and output sequence length (osl) of 2k, we ...

(QB_NEW_EN)


[grammar] ~174-~174: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream re...

(QB_NEW_EN)


[grammar] ~202-~202: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...

(QB_NEW_EN)


[grammar] ~206-~206: Ensure spelling is correct
Context: ...ut the communication implementation for >4GPUs is suboptimal and we are actively worki...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~374-~374: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...

(QB_NEW_EN)


[grammar] ~374-~374: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - Check GPU utilization with nvidia-smi ...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

22-22: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)

216-216: Ensure model identifier is consistent with bench examples

Bench commands use “openai/gpt-oss-120b” while serve uses “gpt-oss-120b”. For clarity and fewer surprises, consider using the same HF repo id in serve, unless a local path is intended.

-  gpt-oss-120b \
+  openai/gpt-oss-120b \

If users choose a local path, keep the adjacent comment guidance (moved to its own line per the other fix).

Also applies to: 234-234


114-114: Anchor fix looks good

The intra-doc link now correctly targets “#h200h100-only-using-openai-triton-kernels-for-moe”.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
@bobboli bobboli enabled auto-merge (squash) August 16, 2025 05:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)

215-226: Two serve-command bugs: broken line continuation and wrong kv-cache flag; also unify model identifier.

  • Trailing backslash followed by an inline comment breaks the shell command (Line 216). The same issue occurs on the --max_batch_size line (Line 224).
  • Flag name should be --kv_cache_free_gpu_mem_fraction (not ...memory...) for consistency with bench and CLI.
  • Use the same model identifier as benchmarks (openai/gpt-oss-120b) for consistency, or clearly show a local path alternative on its own lines.
 trtllm-serve \
-  gpt-oss-120b \  # Or ${local_model_path}
+  openai/gpt-oss-120b \
+  # Or, alternatively, pass a local path instead of the model name:
+  # ${local_model_path}
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size 1  \
   --extra_llm_api_options low_latency.yaml \
-  --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 1
+  --kv_cache_free_gpu_mem_fraction 0.9 \
+  # E.g., 1
+  --max_batch_size ${max_batch_size} \
   --trust_remote_code

233-244: Repeat fixes in max-throughput serve command: line continuation + kv-cache flag + model id.

Same issues as the low‑latency block: trailing backslashes with inline comments (Lines 234, 242) and the kv-cache flag name (Line 241). Also align the model identifier.

 trtllm-serve \
-  gpt-oss-120b \  # Or ${local_model_path}
+  openai/gpt-oss-120b \
+  # Or, alternatively, pass a local path instead of the model name:
+  # ${local_model_path}
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size ${num_gpus} \
   --extra_llm_api_options max_throughput.yaml \
-  --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 640 
+  --kv_cache_free_gpu_mem_fraction 0.9 \
+  # E.g., 640
+  --max_batch_size ${max_batch_size} \
   --trust_remote_code
🧹 Nitpick comments (5)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)

22-22: Fix markdownlint: remove extra space after heading marker.

Two spaces after the hashes violate MD019 and render oddly in some viewers.

-###  NGC Docker Image of dev branch
+### NGC Docker Image of dev branch

352-359: Good: Triton section title and install guidance are now clear and linked. Minor grammar polish.

  • Title and anchor alignment look correct, and the installation link is helpful.
  • Minor wording improvement to tighten phrasing.
-The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` has prepared Triton already (`echo $TRITON_ROOT` could reveal the path). In other situations, you will need to build and install a specific version of Triton.
+The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` container already includes Triton (`echo $TRITON_ROOT` can reveal the path). In other environments, you will need to build and install a specific version of Triton.

12-17: Clarify H100 positioning vs section below.

Prerequisites exclude H100 but later sections include “(H200/H100 Only)” and recommend Triton on Hopper (which includes H100). To avoid confusion, add a brief note here that while detailed H100 performance guidance is forthcoming, the MoE backend selection guidance applies to H100 too.

 - 1x NVIDIA B200/GB200/H200 GPU (more GPUs could be used for lower latency and higher throughput)
+ - 1x NVIDIA B200/GB200/H200 GPU (more GPUs could be used for lower latency and higher throughput)
+   Note: H100 guidance for peak performance is forthcoming; however, the MoE backend selection in this guide also applies to H100.
@@
-We have a forthcoming guide for achieving great performance on H100; however, this guide focuses on the GPUs listed above.
+We have a forthcoming guide for achieving great performance on H100; however, this guide otherwise focuses on the GPUs listed above.

206-206: Typo/spacing: “>4GPUs” → “> 4 GPUs”.

Improves readability.

-... implementation for >4GPUs is suboptimal ...
+... implementation for > 4 GPUs is suboptimal ...

95-112: Style: reduce repetitive phrasing in “Key takeaways.”

Three bullets start with “is set to …”. Consider tightening to avoid repetition.

-- `enable_attention_dp` is set to `false` to use TP instead of DP for attention.
-- `use_torch_sampler` is set to `true` to use the PyTorch sampler. While the `TRTLLM` sampler is the default, it currently has performance issues, so we use the PyTorch sampler instead.
-- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
-- `cuda_graph_config.enable_padding` is set to `true` to enable CUDA graph padding.
-- `moe_config.backend` is set to `TRTLLM` to use the `trtllm-gen` MoE kernels which are optimized for low concurrency.
+- `enable_attention_dp: false` uses TP instead of DP for attention.
+- `use_torch_sampler: true` selects the PyTorch sampler. While `TRTLLM` is the default, it currently has performance issues.
+- `cuda_graph_config.max_batch_size` defines the maximum batch size for CUDA graph; `cuda_graph_config.enable_padding: true` turns on CUDA graph padding.
+- `moe_config.backend: TRTLLM` uses the `trtllm-gen` MoE kernels, optimized for low concurrency.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f0aa30f and b6d43fc.

📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~3-~3: There might be a mistake here.
Context: ... walk you through how to launch your own high-performance TensorRT-LLM server for...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...

(QB_NEW_EN)


[grammar] ~41-~41: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...

(QB_NEW_EN)


[grammar] ~42-~42: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...

(QB_NEW_EN)


[grammar] ~43-~43: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...

(QB_NEW_EN)


[grammar] ~44-~44: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...

(QB_NEW_EN)


[grammar] ~45-~45: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...

(QB_NEW_EN)


[grammar] ~46-~46: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for performance optimization...

(QB_NEW_EN)


[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[grammar] ~148-~148: There might be a mistake here.
Context: ...ghput that saturates the system's GPUs. Using input sequence length (isl) of 1k and o...

(QB_NEW_EN)


[grammar] ~148-~148: There might be a mistake here.
Context: ...Using input sequence length (isl) of 1k and output sequence length (osl) of 2k, we ...

(QB_NEW_EN)


[grammar] ~174-~174: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream re...

(QB_NEW_EN)


[grammar] ~202-~202: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...

(QB_NEW_EN)


[grammar] ~206-~206: Ensure spelling is correct
Context: ...ut the communication implementation for >4GPUs is suboptimal and we are actively worki...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~374-~374: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...

(QB_NEW_EN)


[grammar] ~374-~374: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - Check GPU utilization with nvidia-smi ...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

22-22: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)

114-114: Good: intra-doc link anchor matches section title.

The fragment #h200h100-only-using-openai-triton-kernels-for-moe correctly targets the “(H200/H100 Only) …” section.


121-137: Sanity check passed: CLI flag and dataset script path are correct

  • Verified that trtllm-bench uses --kv_cache_free_gpu_mem_fraction (not “memory”).
  • Confirmed the dataset preparation script lives at benchmarks/cpp/prepare_dataset.py.

No changes required.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (6)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (6)

213-213: Bash line-continuation bug: trailing backslash followed by an inline comment breaks the command

Move the comment to its own line so the backslash is the last character.

-  gpt-oss-120b \  # Or ${local_model_path}
+  gpt-oss-120b \
+  # Or, alternatively, pass a local path instead of the model name:
+  # ${local_model_path}

221-221: Same line-continuation issue on --max_batch_size line

Inline comment after a trailing backslash breaks the command.

-  --max_batch_size ${max_batch_size} \  # E.g., 1
+  --max_batch_size ${max_batch_size} \
+  # E.g., 1

220-220: CLI flag mismatch: use --kv_cache_free_gpu_mem_fraction (not ...memory...)

Keep this consistent with bench commands and the CLI.

-  --kv_cache_free_gpu_memory_fraction 0.9 \
+  --kv_cache_free_gpu_mem_fraction 0.9 \

231-231: Repeat line-continuation fix in max-throughput serve command

The inline comment after the backslash breaks the multi-line command.

-  gpt-oss-120b \  # Or ${local_model_path}
+  gpt-oss-120b \
+  # Or, alternatively, pass a local path instead of the model name:
+  # ${local_model_path}

239-239: Same line-continuation issue on --max_batch_size (throughput variant)

Move the example to a separate line.

-  --max_batch_size ${max_batch_size} \  # E.g., 640 
+  --max_batch_size ${max_batch_size} \
+  # E.g., 640

238-238: Fix kv-cache flag name in max-throughput serve command

Align with bench usage and correct flag spelling.

-  --kv_cache_free_gpu_memory_fraction 0.9 \
+  --kv_cache_free_gpu_mem_fraction 0.9 \
🧹 Nitpick comments (4)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4)

22-22: Fix markdownlint MD019: remove extra space in heading

Atx headings should have a single space after the hashes.

-###  NGC Docker Image of dev branch
+### NGC Docker Image of dev branch

49-50: Define local_model_path before use to avoid confusion

You reference ${local_model_path} in later commands but never show how to set it. Add a short example to make copy/paste flows work.

 Additionally, the container mounts your user `.cache` directory to save the downloaded model checkpoints, which are stored in `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. You can also download the weights to a custom location (we assume `${local_model_path}` is the path to the local model weights).
+
+For example:
+
+```bash
+# If you've pre-downloaded weights to a local directory:
+export local_model_path=/data/models/gpt-oss-120b
+```

107-107: Typo in bullet: stray "s-" prefix

Remove the stray character to keep list formatting consistent.

-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
+- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.

119-123: Clarify model vs model_path precedence in trtllm-bench

You pass both --model and --model_path. If both are supported, a brief note on precedence helps avoid user confusion; otherwise, show two variants (remote vs local).

Example clarity:

  • Remote weights:
    trtllm-bench ... --model openai/gpt-oss-120b ...
  • Local weights:
    trtllm-bench ... --model_path ${local_model_path} ...
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b6d43fc and 232c738.

📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~3-~3: There might be a mistake here.
Context: ... walk you through how to launch your own high-performance TensorRT-LLM server for...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...

(QB_NEW_EN)


[grammar] ~41-~41: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...

(QB_NEW_EN)


[grammar] ~42-~42: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...

(QB_NEW_EN)


[grammar] ~43-~43: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...

(QB_NEW_EN)


[grammar] ~44-~44: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...

(QB_NEW_EN)


[grammar] ~45-~45: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...

(QB_NEW_EN)


[grammar] ~46-~46: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for performance optimization...

(QB_NEW_EN)


[grammar] ~106-~106: There might be a mistake here.
Context: ...to use TP instead of DP for attention. s-cuda_graph_config.max_batch_size` is...

(QB_NEW_EN)


[grammar] ~146-~146: There might be a mistake here.
Context: ...ghput that saturates the system's GPUs. Using input sequence length (isl) of 1k and o...

(QB_NEW_EN)


[grammar] ~146-~146: There might be a mistake here.
Context: ...Using input sequence length (isl) of 1k and output sequence length (osl) of 2k, we ...

(QB_NEW_EN)


[grammar] ~171-~171: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream re...

(QB_NEW_EN)


[grammar] ~199-~199: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...

(QB_NEW_EN)


[grammar] ~203-~203: Ensure spelling is correct
Context: ...ut the communication implementation for >4GPUs is suboptimal and we are actively worki...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~371-~371: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...

(QB_NEW_EN)


[grammar] ~371-~371: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - Check GPU utilization with nvidia-smi ...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

22-22: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)

112-112: Good fix: intra-doc link anchor now matches the section title

Anchor fragment #h200h100-only-using-openai-triton-kernels-for-moe correctly aligns with the section header.


353-356: Nice clarification on Triton availability and installation link

This addresses the earlier feedback: dev container includes Triton, and external users get a clear installation path.

@bobboli
Copy link
Collaborator Author

bobboli commented Aug 18, 2025

/bot skip --comment "doc update"

1 similar comment
@bobboli
Copy link
Collaborator Author

bobboli commented Aug 18, 2025

/bot skip --comment "doc update"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15570 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15570 [ skip ] completed with state SUCCESS
Skipping testing for commit 232c738

@bobboli
Copy link
Collaborator Author

bobboli commented Aug 18, 2025

/bot skip --comment "doc update"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15582 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15582 [ skip ] completed with state SUCCESS
Skipping testing for commit 16876d5

@bobboli bobboli merged commit 8b05b5d into NVIDIA:main Aug 18, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants