-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][doc] Update gpt oss doc #6954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
📝 WalkthroughWalkthroughRewrites the GPT‑OSS deployment blog into an install → benchmark → serve workflow: adds Day‑0 GPT‑OSS support, GPU‑centric prerequisites, multiple install paths (NGC dev image, build from source, releases), replaces mpirun-based serve/bench steps with trtllm-bench and env-driven configs, expands weight/cache handling, adds H200/Triton MoE guidance, and updates testing and troubleshooting. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Installer
participant Cache as ModelCache
participant Bench as trtllm-bench
participant Server as TRT-LLM Server
User->>Installer: Install TensorRT-LLM (NGC/dev image, pip, or build from source)
User->>User: Configure env vars (max_batch_size, num_gpus, local_model_path)
User->>Cache: Populate ~/.cache or ${local_model_path} with weights
User->>Bench: Run trtllm-bench (dataset, concurrency, batch size)
Bench->>Cache: Load model/weights
Bench-->>User: Report tps/user and tps/gpu
User->>Server: Launch TRT-LLM Server (docker run / trtllm-serve)
User->>Server: Send sample request (curl)
Server-->>User: Return JSON response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
/bot skip --comment "doc update" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)
277-346: Replace unrealistic example response with a concise, valid OpenAI-style JSONThe current example includes non-API tokens (“<|channel|>analysis…”) and an excessively long, markdown-heavy response that may confuse users. Provide a minimal, realistic response that mirrors the server’s schema.
Proposed replacement:
-```bash -{ - "id": "chatcmpl-c440e2a3e7e14cd699295afc3739bf42", - "object": "chat.completion", - "created": 1754358426, - "model": "openai/gpt-oss-120b", - "choices": [ - { - "index": 0, - "message": { - "role": "assistant", - "content": "<|channel|>analysis<|message|>The user asks: \"What is NVIDIA's advantage for inference?\" ... - ... (omitted for brevity) ... - }, - "logprobs": null, - "finish_reason": "length", - "stop_reason": null, - "disaggregated_params": null - } - ], - "usage": { - "prompt_tokens": 17, - "total_tokens": 1041, - "completion_tokens": 1024 - }, - "prompt_token_ids": null -} -``` +```bash +{ + "id": "chatcmpl-1234567890", + "object": "chat.completion", + "created": 1721000000, + "model": "openai/gpt-oss-120b", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "NVIDIA’s inference advantage comes from specialized Tensor Cores, the TensorRT compiler/runtime, and an optimized software stack (CUDA, cuDNN, Triton) that together deliver high throughput at low latency." + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 17, + "completion_tokens": 42, + "total_tokens": 59 + } +} +```
🧹 Nitpick comments (8)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (8)
54-55: Grammar: “is has been” → “has been”Minor but visible grammar issue.
-The support for gpt-oss is has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM. +Support for gpt-oss has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM.
27-28: Grammar: “follow docker command” → “following docker command”-Run the follow docker command to start the TensorRT-LLM container in interactive mode: +Run the following docker command to start the TensorRT-LLM container in interactive mode:
23-23: Markdown style: remove extra space after heading marker-### NGC Docker Image of dev branch +### NGC Docker Image of dev branch
140-141: Typo: “requeests” → “requests”; tighten the wordingAlso reads better as “a sufficient number of requests.”
-`--max_batch_size` controls the maximum batch size that the inference engine could serve, while `--concurrency` is the number of concurrent requests that the benchmarking client is sending. `--num_requests` is set to 10 times of `--concurrency` to run enough number of requeests. +`--max_batch_size` controls the maximum batch size the inference engine can serve, while `--concurrency` is the number of concurrent requests the benchmarking client sends. `--num_requests` is set to 10× `--concurrency` to run a sufficient number of requests.
174-178: Grammar and tone: bullet intros should be parallel and capitalized“Compare to” → “Compared to”; start each bullet with “Set …” for consistency.
-Compare to the low-latency configuration, we: -- set `enable_attention_dp` to `true` to use attention DP which is better for high throughput. -- set `stream_interval` to 10 to stream the results to the client every 10 tokens. At high concurrency the detokenization overhead of the streaming mode cannot be hidden under GPU execution time, `stream_interval` is a workaround to reduce the overhead. -- set `moe_config.backend` to `CUTLASS` to use the `CUTLASS` MoE kernels which are optimized for high throughput. +Compared to the low-latency configuration, we: +- Set `enable_attention_dp` to `true` to use attention DP, which is better for high throughput. +- Set `stream_interval` to 10 to stream the results to the client every 10 tokens. At high concurrency, the detokenization overhead of streaming cannot be hidden under GPU execution time; `stream_interval` mitigates the overhead. +- Set `moe_config.backend` to `CUTLASS` to use MoE kernels optimized for high throughput.
269-269: Typo in sample request payload: missing apostropheJSON example should read “NVIDIA's advantage”.
- "content": "What is NVIDIAs advantage for inference?" + "content": "What is NVIDIA's advantage for inference?"
50-51: Minor readability: add commas and possessiveNit, but this sentence reads more naturally with punctuation and possessive.
-Lastly the container mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. +Lastly, the container mounts your user's `.cache` directory to save downloaded model checkpoints, which are stored in `~/.cache/huggingface/hub/` by default.
367-374: Minor grammar in Troubleshooting bulletsA few small tweaks improve clarity.
-- Add `print_iter_log: true` to extra LLM API options YAML file to inspect the per-iteration log. +- Add `print_iter_log: true` to the extra LLM API options YAML file to inspect the per-iteration log. -- For performance issues, check GPU utilization with `nvidia-smi` while the server is running +- For performance issues, check GPU utilization with `nvidia-smi` while the server is running. -- If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed +- If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed. -- For connection issues, make sure port 8000 is not being used by another application +- For connection issues, make sure port 8000 is not being used by another application.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md(3 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~13-~13: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...
(QB_NEW_EN)
[grammar] ~14-~14: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...
(QB_NEW_EN)
[grammar] ~42-~42: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...
(QB_NEW_EN)
[style] ~112-~112: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[grammar] ~144-~144: There might be a mistake here.
Context: ...ieve 420 tps/user with 8x B200 GPUs and max batch size 1. ### Max-Throughput ...
(QB_NEW_EN)
[grammar] ~149-~149: There might be a mistake here.
Context: ...1k and osl=2k dataset, currently we can achieve batch size 640 with 8x B200 GPUs. ```b...
(QB_NEW_EN)
[grammar] ~175-~175: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream th...
(QB_NEW_EN)
[grammar] ~176-~176: There might be a mistake here.
Context: ... is a workaround to reduce the overhead. - set moe_config.backend to CUTLASS to...
(QB_NEW_EN)
[grammar] ~203-~203: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...
(QB_NEW_EN)
[grammar] ~370-~370: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...
(QB_NEW_EN)
[grammar] ~370-~370: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - For performance issues, check GPU utiliz...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
23-23: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
🔇 Additional comments (4)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4)
62-66: Solid end-to-end restructuring and clarity improvementsThe doc now reads like a reproducible workflow (install → benchmark → serve), with clear separation of low-latency vs. throughput paths and environment-driven configs. Nice work consolidating benchmarking around trtllm-bench and calling out MoE/H200 guidance explicitly.
203-208: Verify “DP4EP4 with 4x B200 GPUs” claim and terminology“DP4EP4 using 4x B200 GPUs” reads as 16-way parallelism (4×4). If “DP” here refers to attention data-parallel shards (not classic replica DP), the phrasing can mislead. Please confirm the intended configuration and consider clarifying the meaning of DP in this context (e.g., “attention DP degree 4 with EP 4 on 4 GPUs”).
Would you like me to propose a short clarification sentence once you confirm the exact parallelism topology?
118-138: Confirm subcommand alignment: low-latency section usesthroughputmodeFor low-latency tps/user evaluation, confirm
trtllm-bench throughputis the intended subcommand (vs. a dedicated latency mode, if available). If “throughput” is correct, consider a brief note that tps/user at concurrency=1 is a good proxy for min-latency.If there is a latency-specific mode, I can help rework the snippet accordingly.
69-79: Confirmed: Hugging Face handle openai/gpt-oss-120b is valid and publicly accessibleVerified: the handle exists on Hugging Face (model page and models API returned HTTP 200). The tokenizer/model references in the doc are correct and metadata is publicly reachable (no special auth required to view the model page/API).
Locations referencing the handle (for your awareness):
- docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md — lines ~72 (tokenizer), 120–130, 180–190 (trtllm-bench --model), ~217, ~235, ~265, ~284 (JSON/examples)
- tests/integration/defs/accuracy/test_llm_api_pytorch.py — around lines 2490, 2521, 2559, 2589 (local gpt-oss references)
- examples/models/core/gpt_oss/README.md
- README.md (blog link)
|
PR_Github #15457 [ skip ] triggered by Bot |
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Outdated
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Outdated
Show resolved
Hide resolved
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)
217-227: Fix broken multi-line command (line-continuation) and incorrect KV-cache flag in low-latency serve example
- Inline comments after a trailing backslash break shell line-continuation (Lines 217 and 225).
- Flag name should be
--kv_cache_free_gpu_mem_fractionto match CLI and earlier examples (Line 224).Apply:
- gpt-oss-120b \ # Or ${local_model_path} + gpt-oss-120b \ + # Or, alternatively, pass a local path instead of the model name: + # ${local_model_path} @@ - --kv_cache_free_gpu_memory_fraction 0.9 \ + --kv_cache_free_gpu_mem_fraction 0.9 \ @@ - --max_batch_size ${max_batch_size} \ # E.g., 1 + --max_batch_size ${max_batch_size} \ + # E.g., 1
234-245: Repeat: fix line-continuation and KV-cache flag in max-throughput serve exampleThe same two issues appear here (Lines 235, 242, and 243).
- gpt-oss-120b \ # Or ${local_model_path} + gpt-oss-120b \ + # Or, alternatively, pass a local path instead of the model name: + # ${local_model_path} @@ - --kv_cache_free_gpu_memory_fraction 0.9 \ + --kv_cache_free_gpu_mem_fraction 0.9 \ @@ - --max_batch_size ${max_batch_size} \ # E.g., 640 + --max_batch_size ${max_batch_size} \ + # E.g., 640
🧹 Nitpick comments (5)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)
23-23: Fix markdown heading spacing (lint: MD019)Remove the extra space after the ATX heading marker.
-### NGC Docker Image of dev branch +### NGC Docker Image of dev branch
54-54: Grammar fix and clarity improvements-The support for gpt-oss is has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM. We are continuing to optimize the performance of gpt-oss, you can build the TensorRT-LLM from source to get the latest features and support. Please refer to the [doc](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) if you want to build from source by yourself. +The support for gpt-oss has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM. We are continuing to optimize the performance of gpt-oss; you can build TensorRT-LLM from source to get the latest features and support. Please refer to the [doc](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) if you want to build from source yourself.
140-140: Typos and concise phrasing in benchmark description-`--max_batch_size` controls the maximum batch size that the inference engine could serve, while `--concurrency` is the number of concurrent requests that the benchmarking client is sending. `--num_requests` is set to 10 times of `--concurrency` to run enough number of requeests. +`--max_batch_size` controls the maximum batch size that the inference engine can serve, while `--concurrency` is the number of concurrent requests that the benchmarking client is sending. `--num_requests` is set to 10× `--concurrency` to run enough requests.
174-176: Grammar: “Compared to”-Compare to the low-latency configuration, we: +Compared to the low-latency configuration, we:
355-360: Clarify Triton install vs preinstalled kernels in NGC imageThe text first states the NGC container already includes Triton MoE kernels (no install needed), then immediately instructs building/installing Triton. Clarify this to avoid confusion.
-OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper based GPUs like NVIDIA's H200 for best performance. The NGC TensorRT-LLM container image mentioned above already includes the required kernels so you do not need to build or install them. It is highly recommended to enable them with the steps below: +OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for best performance. The NGC TensorRT-LLM container image mentioned above already includes the required kernels, so you typically do not need to build or install them. If you are building from source or using a base image without these kernels, follow the steps below: @@ -### Installing OpenAI Triton +### Triton installation (only if building from source) @@ -You will need to build and install a specific version of Triton. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe). +If you are building TensorRT-LLM from source or using a base image without the prebuilt kernels, build and install the required Triton version by following this [guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe). Otherwise (when using the NGC container above), you can skip this step.Please confirm the NGC dev image tag indeed includes the Triton MoE kernels for the gpt-oss path so the “skip” guidance is accurate for all users of that image.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md(3 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~13-~13: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...
(QB_NEW_EN)
[grammar] ~14-~14: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...
(QB_NEW_EN)
[grammar] ~42-~42: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...
(QB_NEW_EN)
[grammar] ~43-~43: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...
(QB_NEW_EN)
[grammar] ~44-~44: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...
(QB_NEW_EN)
[grammar] ~45-~45: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...
(QB_NEW_EN)
[grammar] ~46-~46: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...
(QB_NEW_EN)
[grammar] ~47-~47: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for low-latency perf optimiz...
(QB_NEW_EN)
[style] ~112-~112: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[grammar] ~144-~144: There might be a mistake here.
Context: ...ieve 420 tps/user with 8x B200 GPUs and max batch size 1. ### Max-Throughput ...
(QB_NEW_EN)
[grammar] ~149-~149: There might be a mistake here.
Context: ...1k and osl=2k dataset, currently we can achieve batch size 640 with 8x B200 GPUs. ```b...
(QB_NEW_EN)
[grammar] ~175-~175: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream th...
(QB_NEW_EN)
[grammar] ~176-~176: There might be a mistake here.
Context: ... is a workaround to reduce the overhead. - set moe_config.backend to CUTLASS to...
(QB_NEW_EN)
[grammar] ~203-~203: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...
(QB_NEW_EN)
[grammar] ~375-~375: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...
(QB_NEW_EN)
[grammar] ~375-~375: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - For performance issues, check GPU utiliz...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
23-23: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
115-115: Link fragments should be valid
(MD051, link-fragments)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Outdated
Show resolved
Hide resolved
|
PR_Github #15457 [ skip ] completed with state |
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)
215-226: Great serve section; a couple of blocking shell issues to fix (line-continuation + wrong flag).
- The inline comments after backslashes on Line 216 and Line 224 break the multi-line command. In bash, the backslash must be the last character on the line.
- The flag on Line 223 should be --kv_cache_free_gpu_mem_fraction (to match bench commands and the CLI).
Apply this diff:
trtllm-serve \ - gpt-oss-120b \ # Or ${local_model_path} + gpt-oss-120b \ --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size 1 \ --extra_llm_api_options low_latency.yaml \ - --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 1 + --kv_cache_free_gpu_mem_fraction 0.9 \ + --max_batch_size ${max_batch_size} \ --trust_remote_code
233-244: Repeat the same fixes in the max-throughput serve command.
- Remove the inline comment after the backslash (Line 234).
- Use the canonical kv-cache flag name (Line 241).
- Remove the inline comment after the backslash (Line 242).
trtllm-serve \ - gpt-oss-120b \ # Or ${local_model_path} + gpt-oss-120b \ --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size ${num_gpus} \ --extra_llm_api_options max_throughput.yaml \ - --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 640 + --kv_cache_free_gpu_mem_fraction 0.9 \ + --max_batch_size ${max_batch_size} \ --trust_remote_code
114-114: Fix the broken intra-doc anchor to match the section title.The linked section is “(H200/H100 Only) Using OpenAI Triton Kernels for MoE”, so the anchor should be
#h200h100-only-using-openai-triton-kernels-for-moe.-> Note: If you are using NVIDIA H200 GPUs it is highly recommended to set the `moe_config.backend` to TRITON to use the OpenAI Triton MoE kernel. See the section [(H200 Only) Using OpenAI Triton Kernels for MoE](#h200-only-using-openai-triton-kernels-for-moe) for more details. +> Note: If you are using NVIDIA H200 GPUs it is highly recommended to set the `moe_config.backend` to TRITON to use the OpenAI Triton MoE kernel. See the section [(H200/H100 Only) Using OpenAI Triton Kernels for MoE](#h200h100-only-using-openai-triton-kernels-for-moe) for more details.
🧹 Nitpick comments (5)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)
22-22: Markdown lint: remove extra space in the heading.-### NGC Docker Image of dev branch +### NGC Docker Image of dev branch
206-206: Minor readability: add spaces/clarify “>4GPUs”.-... implementation for >4GPUs is suboptimal ... +... implementation for > 4 GPUs is suboptimal ...
28-38: Optional: run container as the host user to avoid root-owned files in ~/.cache.Mounting
~/.cacheinto/root/.cacheas root can leave root-owned artifacts on the host. Consider running the container with the host UID/GID.-docker run --rm --ipc=host -it \ +docker run --rm --ipc=host -it \ + --user $(id -u):$(id -g) \ --ulimit stack=67108864 \ --ulimit memlock=-1 \ --gpus all \ -p 8000:8000 \ -e TRTLLM_ENABLE_PDL=1 \ -v ~/.cache:/root/.cache:rw \ nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev \ /bin/bash
215-226: Security note: only set --trust_remote_code if needed.Recommend adding a short caution that --trust_remote_code executes arbitrary model code; users should enable it only when necessary and with trusted sources.
353-359: Nice clarification on Triton install; small wording tweak for clarity.Consider rephrasing slightly for flow.
-The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` has Triton installed already (`echo $TRITON_ROOT` could reveal the path). In other situations, you will need to build and install a specific version of Triton. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe). +The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` image includes Triton already (you can verify with `echo $TRITON_ROOT`). If you’re not using the dev container, you’ll need to build/install a specific Triton version; follow the instructions here: [Using OpenAI Triton Kernels for MoE](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe).
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~3-~3: There might be a mistake here.
Context: ... walk you through how to launch your own high-performance TensorRT-LLM server for...
(QB_NEW_EN)
[grammar] ~12-~12: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...
(QB_NEW_EN)
[grammar] ~13-~13: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...
(QB_NEW_EN)
[grammar] ~41-~41: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...
(QB_NEW_EN)
[grammar] ~42-~42: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...
(QB_NEW_EN)
[grammar] ~43-~43: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...
(QB_NEW_EN)
[grammar] ~44-~44: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...
(QB_NEW_EN)
[grammar] ~45-~45: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...
(QB_NEW_EN)
[grammar] ~46-~46: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for performance optimization...
(QB_NEW_EN)
[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[grammar] ~148-~148: There might be a mistake here.
Context: ...ghput that saturates the system's GPUs. Using input sequence length (isl) of 1k and o...
(QB_NEW_EN)
[grammar] ~148-~148: There might be a mistake here.
Context: ...Using input sequence length (isl) of 1k and output sequence length (osl) of 2k, we ...
(QB_NEW_EN)
[grammar] ~174-~174: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream re...
(QB_NEW_EN)
[grammar] ~202-~202: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...
(QB_NEW_EN)
[grammar] ~206-~206: Ensure spelling is correct
Context: ...ut the communication implementation for >4GPUs is suboptimal and we are actively worki...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~374-~374: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...
(QB_NEW_EN)
[grammar] ~374-~374: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - Check GPU utilization with nvidia-smi ...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
22-22: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
114-114: Link fragments should be valid
(MD051, link-fragments)
🔇 Additional comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)
121-137: Double-specifying both --model and --model_path may be confusing.It’s typically either a model ID or a local path. If both are accepted, fine; otherwise, we should pick one and document the alternative separately to avoid user error.
Would you like me to adjust the examples to provide two separate blocks (one with --model, one with --model_path) to remove ambiguity?
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)
216-216: Fix bash line-continuation: inline comments after backslashes break the commandThe inline comment after the trailing backslash ends the continuation and will cause the next flags to be parsed as a separate command. Move comments to their own lines.
trtllm-serve \ - gpt-oss-120b \ # Or ${local_model_path} + gpt-oss-120b \ + # Or, alternatively, pass a local path instead of the model name: + # ${local_model_path} --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size 1 \ --extra_llm_api_options low_latency.yaml \ - --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 1 + --kv_cache_free_gpu_mem_fraction 0.9 \ + --max_batch_size ${max_batch_size} \ + # E.g., 1 --trust_remote_codeAlso applies to: 224-224
223-223: Use the correct kv-cache flag: --kv_cache_free_gpu_mem_fractionServe command uses an invalid flag spelling; bench uses the correct one. Align for consistency and to avoid runtime errors.
- --kv_cache_free_gpu_memory_fraction 0.9 \ + --kv_cache_free_gpu_mem_fraction 0.9 \
234-234: Fix same line-continuation issue in max-throughput serve commandInline comments after backslashes break the multi-line command. Move them to separate lines.
trtllm-serve \ - gpt-oss-120b \ # Or ${local_model_path} + gpt-oss-120b \ + # Or, alternatively, pass a local path instead of the model name: + # ${local_model_path} --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size ${num_gpus} \ --extra_llm_api_options max_throughput.yaml \ - --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 640 + --kv_cache_free_gpu_mem_fraction 0.9 \ + --max_batch_size ${max_batch_size} \ + # E.g., 640 --trust_remote_codeAlso applies to: 242-242
🧹 Nitpick comments (4)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4)
210-214: Make env var dependencies explicit before serve commandsThe serve snippets rely on num_gpus and max_batch_size defined earlier. Add a short preamble or reminder to set them to avoid copy/paste failures.
## Launch the TensorRT-LLM Server We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run: + +```bash +# Ensure these are set (reuse values from the benchmarking section or adjust as needed) +num_gpus=8 +max_batch_size=1 +```
22-22: Fix markdownlint MD019: remove double space after heading hashesSingle space after the hashes avoids markdownlint warnings and renders consistently.
-### NGC Docker Image of dev branch +### NGC Docker Image of dev branch
49-50: Clarify local_model_path usage with a concrete exampleA brief example reduces confusion when users choose a local path over downloading weights.
Additionally, the container mounts your user `.cache` directory to save the downloaded model checkpoints, which are stored in `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. You can also download the weights to a custom location (we assume `${local_model_path}` is the path to the local model weights). + +For example: +```bash +local_model_path=/data/models/openai/gpt-oss-120b +```
206-206: Minor wording nit: spacing around numeric comparator“>4GPUs” reads better as “> 4 GPUs”.
-... implementation for >4GPUs is suboptimal ... +... implementation for > 4 GPUs is suboptimal ...
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~3-~3: There might be a mistake here.
Context: ... walk you through how to launch your own high-performance TensorRT-LLM server for...
(QB_NEW_EN)
[grammar] ~12-~12: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...
(QB_NEW_EN)
[grammar] ~13-~13: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...
(QB_NEW_EN)
[grammar] ~41-~41: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...
(QB_NEW_EN)
[grammar] ~42-~42: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...
(QB_NEW_EN)
[grammar] ~43-~43: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...
(QB_NEW_EN)
[grammar] ~44-~44: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...
(QB_NEW_EN)
[grammar] ~45-~45: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...
(QB_NEW_EN)
[grammar] ~46-~46: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for performance optimization...
(QB_NEW_EN)
[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[grammar] ~148-~148: There might be a mistake here.
Context: ...ghput that saturates the system's GPUs. Using input sequence length (isl) of 1k and o...
(QB_NEW_EN)
[grammar] ~148-~148: There might be a mistake here.
Context: ...Using input sequence length (isl) of 1k and output sequence length (osl) of 2k, we ...
(QB_NEW_EN)
[grammar] ~174-~174: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream re...
(QB_NEW_EN)
[grammar] ~202-~202: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...
(QB_NEW_EN)
[grammar] ~206-~206: Ensure spelling is correct
Context: ...ut the communication implementation for >4GPUs is suboptimal and we are actively worki...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~374-~374: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...
(QB_NEW_EN)
[grammar] ~374-~374: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - Check GPU utilization with nvidia-smi ...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
22-22: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)
216-216: Ensure model identifier is consistent with bench examplesBench commands use “openai/gpt-oss-120b” while serve uses “gpt-oss-120b”. For clarity and fewer surprises, consider using the same HF repo id in serve, unless a local path is intended.
- gpt-oss-120b \ + openai/gpt-oss-120b \If users choose a local path, keep the adjacent comment guidance (moved to its own line per the other fix).
Also applies to: 234-234
114-114: Anchor fix looks goodThe intra-doc link now correctly targets “#h200h100-only-using-openai-triton-kernels-for-moe”.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)
215-226: Two serve-command bugs: broken line continuation and wrong kv-cache flag; also unify model identifier.
- Trailing backslash followed by an inline comment breaks the shell command (Line 216). The same issue occurs on the
--max_batch_sizeline (Line 224).- Flag name should be
--kv_cache_free_gpu_mem_fraction(not...memory...) for consistency with bench and CLI.- Use the same model identifier as benchmarks (
openai/gpt-oss-120b) for consistency, or clearly show a local path alternative on its own lines.trtllm-serve \ - gpt-oss-120b \ # Or ${local_model_path} + openai/gpt-oss-120b \ + # Or, alternatively, pass a local path instead of the model name: + # ${local_model_path} --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size 1 \ --extra_llm_api_options low_latency.yaml \ - --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 1 + --kv_cache_free_gpu_mem_fraction 0.9 \ + # E.g., 1 + --max_batch_size ${max_batch_size} \ --trust_remote_code
233-244: Repeat fixes in max-throughput serve command: line continuation + kv-cache flag + model id.Same issues as the low‑latency block: trailing backslashes with inline comments (Lines 234, 242) and the kv-cache flag name (Line 241). Also align the model identifier.
trtllm-serve \ - gpt-oss-120b \ # Or ${local_model_path} + openai/gpt-oss-120b \ + # Or, alternatively, pass a local path instead of the model name: + # ${local_model_path} --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size ${num_gpus} \ --extra_llm_api_options max_throughput.yaml \ - --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 640 + --kv_cache_free_gpu_mem_fraction 0.9 \ + # E.g., 640 + --max_batch_size ${max_batch_size} \ --trust_remote_code
🧹 Nitpick comments (5)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)
22-22: Fix markdownlint: remove extra space after heading marker.Two spaces after the hashes violate MD019 and render oddly in some viewers.
-### NGC Docker Image of dev branch +### NGC Docker Image of dev branch
352-359: Good: Triton section title and install guidance are now clear and linked. Minor grammar polish.
- Title and anchor alignment look correct, and the installation link is helpful.
- Minor wording improvement to tighten phrasing.
-The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` has prepared Triton already (`echo $TRITON_ROOT` could reveal the path). In other situations, you will need to build and install a specific version of Triton. +The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` container already includes Triton (`echo $TRITON_ROOT` can reveal the path). In other environments, you will need to build and install a specific version of Triton.
12-17: Clarify H100 positioning vs section below.Prerequisites exclude H100 but later sections include “(H200/H100 Only)” and recommend Triton on Hopper (which includes H100). To avoid confusion, add a brief note here that while detailed H100 performance guidance is forthcoming, the MoE backend selection guidance applies to H100 too.
- 1x NVIDIA B200/GB200/H200 GPU (more GPUs could be used for lower latency and higher throughput) + - 1x NVIDIA B200/GB200/H200 GPU (more GPUs could be used for lower latency and higher throughput) + Note: H100 guidance for peak performance is forthcoming; however, the MoE backend selection in this guide also applies to H100. @@ -We have a forthcoming guide for achieving great performance on H100; however, this guide focuses on the GPUs listed above. +We have a forthcoming guide for achieving great performance on H100; however, this guide otherwise focuses on the GPUs listed above.
206-206: Typo/spacing: “>4GPUs” → “> 4 GPUs”.Improves readability.
-... implementation for >4GPUs is suboptimal ... +... implementation for > 4 GPUs is suboptimal ...
95-112: Style: reduce repetitive phrasing in “Key takeaways.”Three bullets start with “is set to …”. Consider tightening to avoid repetition.
-- `enable_attention_dp` is set to `false` to use TP instead of DP for attention. -- `use_torch_sampler` is set to `true` to use the PyTorch sampler. While the `TRTLLM` sampler is the default, it currently has performance issues, so we use the PyTorch sampler instead. -- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph. -- `cuda_graph_config.enable_padding` is set to `true` to enable CUDA graph padding. -- `moe_config.backend` is set to `TRTLLM` to use the `trtllm-gen` MoE kernels which are optimized for low concurrency. +- `enable_attention_dp: false` uses TP instead of DP for attention. +- `use_torch_sampler: true` selects the PyTorch sampler. While `TRTLLM` is the default, it currently has performance issues. +- `cuda_graph_config.max_batch_size` defines the maximum batch size for CUDA graph; `cuda_graph_config.enable_padding: true` turns on CUDA graph padding. +- `moe_config.backend: TRTLLM` uses the `trtllm-gen` MoE kernels, optimized for low concurrency.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~3-~3: There might be a mistake here.
Context: ... walk you through how to launch your own high-performance TensorRT-LLM server for...
(QB_NEW_EN)
[grammar] ~12-~12: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...
(QB_NEW_EN)
[grammar] ~13-~13: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...
(QB_NEW_EN)
[grammar] ~41-~41: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...
(QB_NEW_EN)
[grammar] ~42-~42: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...
(QB_NEW_EN)
[grammar] ~43-~43: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...
(QB_NEW_EN)
[grammar] ~44-~44: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...
(QB_NEW_EN)
[grammar] ~45-~45: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...
(QB_NEW_EN)
[grammar] ~46-~46: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for performance optimization...
(QB_NEW_EN)
[style] ~111-~111: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...A graph padding. - moe_config.backend is set to TRTLLM to use the trtllm-gen...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[grammar] ~148-~148: There might be a mistake here.
Context: ...ghput that saturates the system's GPUs. Using input sequence length (isl) of 1k and o...
(QB_NEW_EN)
[grammar] ~148-~148: There might be a mistake here.
Context: ...Using input sequence length (isl) of 1k and output sequence length (osl) of 2k, we ...
(QB_NEW_EN)
[grammar] ~174-~174: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream re...
(QB_NEW_EN)
[grammar] ~202-~202: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...
(QB_NEW_EN)
[grammar] ~206-~206: Ensure spelling is correct
Context: ...ut the communication implementation for >4GPUs is suboptimal and we are actively worki...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~374-~374: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...
(QB_NEW_EN)
[grammar] ~374-~374: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - Check GPU utilization with nvidia-smi ...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
22-22: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)
114-114: Good: intra-doc link anchor matches section title.The fragment
#h200h100-only-using-openai-triton-kernels-for-moecorrectly targets the “(H200/H100 Only) …” section.
121-137: Sanity check passed: CLI flag and dataset script path are correct
- Verified that
trtllm-benchuses--kv_cache_free_gpu_mem_fraction(not “memory”).- Confirmed the dataset preparation script lives at
benchmarks/cpp/prepare_dataset.py.No changes required.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (6)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (6)
213-213: Bash line-continuation bug: trailing backslash followed by an inline comment breaks the commandMove the comment to its own line so the backslash is the last character.
- gpt-oss-120b \ # Or ${local_model_path} + gpt-oss-120b \ + # Or, alternatively, pass a local path instead of the model name: + # ${local_model_path}
221-221: Same line-continuation issue on --max_batch_size lineInline comment after a trailing backslash breaks the command.
- --max_batch_size ${max_batch_size} \ # E.g., 1 + --max_batch_size ${max_batch_size} \ + # E.g., 1
220-220: CLI flag mismatch: use --kv_cache_free_gpu_mem_fraction (not ...memory...)Keep this consistent with bench commands and the CLI.
- --kv_cache_free_gpu_memory_fraction 0.9 \ + --kv_cache_free_gpu_mem_fraction 0.9 \
231-231: Repeat line-continuation fix in max-throughput serve commandThe inline comment after the backslash breaks the multi-line command.
- gpt-oss-120b \ # Or ${local_model_path} + gpt-oss-120b \ + # Or, alternatively, pass a local path instead of the model name: + # ${local_model_path}
239-239: Same line-continuation issue on --max_batch_size (throughput variant)Move the example to a separate line.
- --max_batch_size ${max_batch_size} \ # E.g., 640 + --max_batch_size ${max_batch_size} \ + # E.g., 640
238-238: Fix kv-cache flag name in max-throughput serve commandAlign with bench usage and correct flag spelling.
- --kv_cache_free_gpu_memory_fraction 0.9 \ + --kv_cache_free_gpu_mem_fraction 0.9 \
🧹 Nitpick comments (4)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4)
22-22: Fix markdownlint MD019: remove extra space in headingAtx headings should have a single space after the hashes.
-### NGC Docker Image of dev branch +### NGC Docker Image of dev branch
49-50: Define local_model_path before use to avoid confusionYou reference ${local_model_path} in later commands but never show how to set it. Add a short example to make copy/paste flows work.
Additionally, the container mounts your user `.cache` directory to save the downloaded model checkpoints, which are stored in `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. You can also download the weights to a custom location (we assume `${local_model_path}` is the path to the local model weights). + +For example: + +```bash +# If you've pre-downloaded weights to a local directory: +export local_model_path=/data/models/gpt-oss-120b +```
107-107: Typo in bullet: stray "s-" prefixRemove the stray character to keep list formatting consistent.
-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph. +- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
119-123: Clarify model vs model_path precedence in trtllm-benchYou pass both --model and --model_path. If both are supported, a brief note on precedence helps avoid user confusion; otherwise, show two variants (remote vs local).
Example clarity:
- Remote weights:
trtllm-bench ... --model openai/gpt-oss-120b ...- Local weights:
trtllm-bench ... --model_path ${local_model_path} ...
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~3-~3: There might be a mistake here.
Context: ... walk you through how to launch your own high-performance TensorRT-LLM server for...
(QB_NEW_EN)
[grammar] ~12-~12: There might be a mistake here.
Context: ...for lower latency and higher throughput) - Fast SSD storage for model weights - Acc...
(QB_NEW_EN)
[grammar] ~13-~13: There might be a mistake here.
Context: ...ut) - Fast SSD storage for model weights - Access to the gpt-oss-120b model checkpo...
(QB_NEW_EN)
[grammar] ~41-~41: There might be a mistake here.
Context: ...n/bash ``` Explanation of the command: - Automatically removes the container when...
(QB_NEW_EN)
[grammar] ~42-~42: There might be a mistake here.
Context: ...oves the container when stopped (--rm) - Allows container to interact with the ho...
(QB_NEW_EN)
[grammar] ~43-~43: There might be a mistake here.
Context: ...y for optimal performance (--ipc=host) - Runs the container in interactive mode (...
(QB_NEW_EN)
[grammar] ~44-~44: There might be a mistake here.
Context: ...he container in interactive mode (-it) - Sets up shared memory and stack limits f...
(QB_NEW_EN)
[grammar] ~45-~45: There might be a mistake here.
Context: ...and stack limits for optimal performance - Maps port 8000 from the container to the...
(QB_NEW_EN)
[grammar] ~46-~46: There might be a mistake here.
Context: ...port 8000 from the container to the host - Enables PDL for performance optimization...
(QB_NEW_EN)
[grammar] ~106-~106: There might be a mistake here.
Context: ...to use TP instead of DP for attention. s-cuda_graph_config.max_batch_size` is...
(QB_NEW_EN)
[grammar] ~146-~146: There might be a mistake here.
Context: ...ghput that saturates the system's GPUs. Using input sequence length (isl) of 1k and o...
(QB_NEW_EN)
[grammar] ~146-~146: There might be a mistake here.
Context: ...Using input sequence length (isl) of 1k and output sequence length (osl) of 2k, we ...
(QB_NEW_EN)
[grammar] ~171-~171: There might be a mistake here.
Context: ... DP which is better for high throughput. - set stream_interval to 10 to stream re...
(QB_NEW_EN)
[grammar] ~199-~199: There might be a mistake here.
Context: ...for MoE, so we set --ep to num_gpus. - When using enable_attention_dp, `max_b...
(QB_NEW_EN)
[grammar] ~203-~203: Ensure spelling is correct
Context: ...ut the communication implementation for >4GPUs is suboptimal and we are actively worki...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~371-~371: There might be a mistake here.
Context: ...arameters. - Add print_iter_log: true to extra LLM API options YAML file to insp...
(QB_NEW_EN)
[grammar] ~371-~371: There might be a mistake here.
Context: ...L file to inspect the per-iteration log. - Check GPU utilization with nvidia-smi ...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
22-22: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)
112-112: Good fix: intra-doc link anchor now matches the section titleAnchor fragment
#h200h100-only-using-openai-triton-kernels-for-moecorrectly aligns with the section header.
353-356: Nice clarification on Triton availability and installation linkThis addresses the earlier feedback: dev container includes Triton, and external users get a clear installation path.
|
/bot skip --comment "doc update" |
1 similar comment
|
/bot skip --comment "doc update" |
|
PR_Github #15570 [ skip ] triggered by Bot |
|
PR_Github #15570 [ skip ] completed with state |
|
/bot skip --comment "doc update" |
|
PR_Github #15582 [ skip ] triggered by Bot |
|
PR_Github #15582 [ skip ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.