-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7321][doc] Refine GPT-OSS doc #7180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-7321][doc] Refine GPT-OSS doc #7180
Conversation
📝 WalkthroughWalkthroughUpdates a single deployment guide to adjust YAML configs and CLI examples: increases max_batch_size, adds performance controls, removes explicit max_seq_len flag, clarifies parameter descriptions, notes evals compatibility, and adds a benchmarking concurrency guideline. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (5)
68-77: YAML: align indentation under moe_config and document new top-level keys
- Indentation under moe_config is inconsistent (4 spaces vs. 2 under cuda_graph_config). While valid YAML, it’s brittle and easy to drift.
- stream_interval and num_postprocess_workers are introduced but not documented in “Extra LLM API Options.” Add brief descriptions for both.
- Sanity check: confirm both keys are supported by TorchLlmArgs for the pytorch backend path.
Apply the indentation tweak:
enable_attention_dp: false cuda_graph_config: enable_padding: true max_batch_size: 720 moe_config: - backend: TRTLLM + backend: TRTLLM stream_interval: 10 num_postprocess_workers: 4And add brief docs for the new keys in the “Extra LLM API Options” section (see suggested insertion around Line 175):
#### `moe_config` * **Description**: Configuration for Mixture-of-Experts (MoE) models. * **Options**: * `backend`: The backend to use for MoE operations. **Default**: `CUTLASS` +#### `stream_interval` + +* **Description**: Frequency (in tokens) to flush streamed partial responses to the client. Lower values improve “liveness” at modest overhead. Tune based on latency vs. throughput needs. + +#### `num_postprocess_workers` + +* **Description**: Number of worker threads for postprocessing (e.g., detokenization/formatting). Increase to reduce CPU postprocess bottlenecks when GPU throughput is high.
85-95: Mirror the YAML cleanup for the CUTLASS profile and verify key support
- Same indentation nit under moe_config.
- Confirm stream_interval and num_postprocess_workers are honored for the CUTLASS MoE backend profile as well.
enable_attention_dp: true cuda_graph_config: enable_padding: true max_batch_size: 720 moe_config: - backend: CUTLASS + backend: CUTLASS stream_interval: 10 num_postprocess_workers: 4
101-113: Keep CLI max_batch_size and YAML cuda_graph_config.max_batch_size in lockstep; add a guardrail note--max_batch_size is now 720; good. Ensure it always matches cuda_graph_config.max_batch_size to avoid falling off the captured graph path. Consider adding a one-line note near the command reminding readers to keep the two in sync.
139-141: Nice clarification on batch-size vs. total sequence lengthClear and accurate. Consider adding a tiny example (e.g., “1024 in + 1024 out → total 2048”) for concreteness.
272-273: Clarify the concurrency heuristic and fix “through-put” typo
- Wording: “throughput” (no hyphen).
- The rule of thumb “concurrency = max_batch_size * num_gpus” can be misleading with attention DP, TP/EP mixes, or heterogeneous participation in attention. Prefer “number of GPUs participating in attention” and call out that EP shards may not multiply attention capacity.
-To achieve max through-put, with attention DP on, one needs to sweep up to `concurrency = max_batch_size * num_gpus`. +To achieve max throughput with attention DP enabled, sweep concurrency up to: +`concurrency = max_batch_size * num_attention_gpus`, +where `num_attention_gpus` is the number of GPUs participating in attention (often the TP world size). If EP is used, note that expert shards do not necessarily increase attention capacity.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md(6 hunks)
🔇 Additional comments (2)
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (2)
146-149: Double‑check behavior when --max_seq_len is omittedThe note that max_seq_len will be inferred from the model config is plausible, but this can affect memory planning and scheduling. Please verify for trtllm-serve with the pytorch backend that omitting this flag behaves as intended for GPT‑OSS‑120B, and call out how users can override it if needed.
232-236: Tighten punctuation and verify evals compatibility claim– Move the comma outside the inline code span for clarity.
– Confirm thatgpt_oss.evalscan indeed target TRT-LLM Serve’s Chat Completions/Responses API without any shim code in this release.Suggested diff:
- With the added support of Chat Completions and Responses API in `trtllm-serve,` `gpt_oss.evals` works directly without any modifications. + With the added support of Chat Completions and Responses API in `trtllm-serve`, `gpt_oss.evals` works directly without any modifications.
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
Show resolved
Hide resolved
|
/bot skip --comment "No need to run full CI" |
|
PR_Github #16306 [ skip ] triggered by Bot |
|
PR_Github #16306 [ skip ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.