[None][docs] refine docs for accuracy evaluation of gpt-oss models #7252

binghanc · 2025-08-26T08:16:34Z

add docs for accuracy evaluation of gpt-oss models.

Summary by CodeRabbit

Documentation
- Added guidance for configuring and running accuracy evaluations for GPT-OSS on TRT-LLM.
- Clarified required server parameters (attention DP, tensor/experts parallelism, max batch size, max tokens).
- Documented use of the reasoning-effort flag when launching evaluations.
- Included a reference table mapping reasoning-effort levels to parallel settings and token/batch limits.
- Provided an example command for running evaluations (chat_completions, GPQA/AIME25, 120B model, low/medium effort).
- Inserted guidance in the evaluation intro and optional verification sections.

coderabbitai · 2025-08-26T08:16:41Z

📝 Walkthrough

Walkthrough

Adds documentation to the GPT‑OSS on TRT‑LLM quick-start guide explaining required TRT‑LLM server flags and gpt‑oss evaluation options, provides a mapping table of reasoning‑effort to parallel config and limits, and includes an example eval command. Changes are documentation-only and inserted in two locations on the page.

Changes

Cohort / File(s)	Summary
Docs: GPT‑OSS on TRT‑LLM Quick Start `docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md`	Added accuracy evaluation guidance: required TRT‑LLM server flags (`enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size`, `max_num_tokens`), `reasoning-effort` for gpt‑oss evals, a reference table mapping `reasoning-effort` to parallel configuration and batch/token limits (four rows), and an example `gpt_oss.evals` command. Inserted after the evaluation intro and under “Running Evaluations to Verify Accuracy (Optional)”.

Sequence Diagram(s)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

[TRTLLM-7321][doc] Refine GPT-OSS doc #7180 — Edits the same quick-start doc with overlapping evaluation and server flag guidance.
[TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site #7143 — Also updates the same guide, adding similar TRT‑LLM server flags and GPT‑OSS evaluation instructions.

Suggested reviewers

dongfengy
QiJune

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (2)

237-237: Capitalize product names and tighten the instruction sentence

Use consistent branding and flag notation. Also add the missing serial comma.

-You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200. 
+Set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size`, and `max_num_tokens` when launching the TRT-LLM server, and set `--reasoning-effort` when launching evaluations in GPT-OSS. Below are reference configurations for accuracy evaluation on B200.

239-245: Fix table grammar/formatting and clarify what DP/TP/EP mean

Grammar/formatting: header-casing, spacing around operators, thousands separators, and clearer alternatives (“or …”) improve readability and address the grammar hints.
Add a short note to disambiguate DP vs TP/EP and to remind readers to align server flags with the table values.

-| **reasoning-effort** | **parallel configuration** | **max_batch_size** | **max_num_tokens** |
-|:--------------------:|:--------------------------:|:------------------:|:------------------:|
-| low/medium           | DP+TP8+EP8 / DP+TP4+EP4    | 128                | 32768              |
-| high                 | DP+TP8+EP8 / DP+TP4+EP4    | 2                  | 133120             |
-| low/medium           | TP8 / TP4                  | 1024               | 32768              |
-| high                 | TP8 / TP4                  | 16                 | 133120             |
+| Reasoning effort | Parallel config                  | Max batch size | Max num tokens |
+|:----------------:|:---------------------------------|---------------:|---------------:|
+| low, medium      | DP + TP8 + EP8 (or DP + TP4 + EP4) |            128 |        33,120 |
+| high             | DP + TP8 + EP8 (or DP + TP4 + EP4) |              2 |       133,120 |
+| low, medium      | TP8 (or TP4)                       |          1,024 |        32,768 |
+| high             | TP8 (or TP4)                       |             16 |       133,120 |
+
+Note: DP here refers to attention data parallel. Enable it via `enable_attention_dp: true` in the YAML and launch across multiple GPUs; TP maps to `--tp_size`, EP maps to `--ep_size`. Ensure your server `--max_num_tokens` is set to at least the value shown in the table for the chosen configuration.

Note: If 33,120 in the first row was intended to be 32,768 (32K), please correct it accordingly.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 23ed0c8 and 9bcc11b.

📒 Files selected for processing (1)

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

🪛 LanguageTool

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

[grammar] ~239-~239: There might be a mistake here.
Context: ...max_batch_size* | max_num_tokens | |:--------------------:|:---------------...

(QB_NEW_EN)

[grammar] ~240-~240: There might be a mistake here.
Context: ...-----------------:|:------------------:| | low/medium | DP+TP8+EP8 / DP...

(QB_NEW_EN)

[grammar] ~241-~241: There might be a mistake here.
Context: ...28 | 32768 | | high | DP+TP8+EP8 / DP...

(QB_NEW_EN)

[grammar] ~242-~242: There might be a mistake here.
Context: ... | 133120 | | low/medium | TP8 / TP4 ...

(QB_NEW_EN)

[grammar] ~243-~243: There might be a mistake here.
Context: ...024 | 32768 | | high | TP8 / TP4 ...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

karljang · 2025-08-27T20:25:06Z

@binghanc , thanks for the contribution~

Could you add "[None][docs]" to the title to address the failure of Check PR Title Format (pull_request)?
Also, could you check above comments from @dongfengy ?

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1)

246-255: Point evals to your local server; fix dataset/model naming and hyphenation

Prepend env vars so gpt_oss.evals hits your local TRT‑LLM endpoint (otherwise it may call OpenAI by default).
In prose: “reasoning effort” and “AIME 2025”.
Use the served model id consistently: openai/gpt-oss-120b (server and Basic Test use this).
Keep the dataset flag --eval gpqa,aime25 but align the prose to “AIME 2025”.

-Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME2025.
+Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning effort on GPQA and AIME 2025.
 
 ```shell
-# execute this command in gpt-oss
-python -m gpt_oss.evals \
+# In the gpt-oss repository, target your local TRT-LLM server:
+export OPENAI_API_KEY="EMPTY"                # Any non-empty value is accepted
+export OPENAI_BASE_URL="http://localhost:8000/v1"
+
+# Then run:
+python -m gpt_oss.evals \
   --sampler chat_completions \
   --eval gpqa,aime25 \
-  --model gpt-oss-120b \
+  --model openai/gpt-oss-120b \
   --reasoning-effort low,medium

+Reminder: Ensure the TRT-LLM server was launched with a --max_num_tokens value compatible with the table above.


</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

**💡 Knowledge Base configuration:**

- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 9bcc11b7a7ccc586705f98ff9cffd16cb3fd63fe and 2ac1ef6200253ec2ced877524ce1b9a79d04e5a6.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧠 Learnings (1)</summary>

<details>
<summary>📚 Learning: 2025-07-28T17:06:08.621Z</summary>

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.


**Applied to files:**
- `docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md`

</details>

</details><details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md</summary>

[grammar] ~239-~239: There might be a mistake here.
Context: ...*max_batch_size** | **max_num_tokens** | |:--------------------:|:---------------...

(QB_NEW_EN)

---

[grammar] ~240-~240: There might be a mistake here.
Context: ...-----------------:|:------------------:| | low/medium           | DEP8 / DEP4    ...

(QB_NEW_EN)

---

[grammar] ~241-~241: There might be a mistake here.
Context: ...28                | 32768              | | high                 | DEP8 / DEP4    ...

(QB_NEW_EN)

---

[grammar] ~242-~242: There might be a mistake here.
Context: ...                  | 133120             | | low/medium           | TP8 / TP4      ...

(QB_NEW_EN)

---

[grammar] ~243-~243: There might be a mistake here.
Context: ...024               | 32768              | | high                 | TP8 / TP4      ...

(QB_NEW_EN)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: Pre-commit Check

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

binghanc · 2025-09-08T00:52:41Z

/bot skip

github-actions · 2025-09-08T00:52:54Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

binghanc · 2025-09-08T01:05:12Z

/bot skip --comments "Modifying the document does not require code testing"

Signed-off-by: binghanc <binghanc@nvidia.com>

nv-guomingz · 2025-09-08T01:26:17Z

/bot skip --comment "docs change only"

tensorrt-cicd · 2025-09-08T01:31:56Z

PR_Github #17959 [ skip ] triggered by Bot

tensorrt-cicd · 2025-09-08T01:55:26Z

PR_Github #17959 [ skip ] completed with state SUCCESS
Skipping testing for commit 0e7fdcd

…VIDIA#7252) Signed-off-by: 176802681+binghanc@users.noreply.github.com

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md Show resolved Hide resolved

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 26, 2025

dongfengy reviewed Aug 26, 2025

View reviewed changes

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md Outdated Show resolved Hide resolved

dongfengy reviewed Aug 26, 2025

View reviewed changes

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md Outdated Show resolved Hide resolved

karljang added the waiting for feedback label Aug 27, 2025

binghanc changed the title ~~Binghanc/gpt oss docs~~ [None][docs] Binghanc/gpt oss docs Sep 1, 2025

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md Show resolved Hide resolved

binghanc force-pushed the binghanc/gpt-oss-docs branch from 2ac1ef6 to 0a257e8 Compare September 2, 2025 04:26

binghanc marked this pull request as ready for review September 2, 2025 04:48

binghanc requested a review from a team as a code owner September 2, 2025 04:48

binghanc requested review from kaiyux and nv-guomingz September 2, 2025 04:48

binghanc changed the title ~~[None][docs] Binghanc/gpt oss docs~~ [None][docs] binghanc/gpt oss docs Sep 2, 2025

binghanc changed the title ~~[None][docs] binghanc/gpt oss docs~~ [None][docs] refine docs for gpt-oss Sep 2, 2025

binghanc changed the title ~~[None][docs] refine docs for gpt-oss~~ [None][docs] refine docs for accuracy evaluation of gpt-oss models Sep 2, 2025

nv-guomingz approved these changes Sep 8, 2025

View reviewed changes

binghanc added 2 commits September 8, 2025 09:24

add docs for accuracy evaluation of gpt-oss models.

6b82a5f

Signed-off-by: binghanc <binghanc@nvidia.com>

refine docs for accuracy evaluation of gpt-oss models

0e7fdcd

Signed-off-by: binghanc <binghanc@nvidia.com>

nv-guomingz force-pushed the binghanc/gpt-oss-docs branch from 6a89fc4 to 0e7fdcd Compare September 8, 2025 01:24

nv-guomingz enabled auto-merge (squash) September 8, 2025 01:26

nv-guomingz disabled auto-merge September 8, 2025 01:28

nv-guomingz merged commit 14ee43e into NVIDIA:main Sep 8, 2025
5 checks passed

Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025

[None][docs] refine docs for accuracy evaluation of gpt-oss models (N…

8c98f39

…VIDIA#7252) Signed-off-by: 176802681+binghanc@users.noreply.github.com

[None][docs] refine docs for accuracy evaluation of gpt-oss models #7252

[None][docs] refine docs for accuracy evaluation of gpt-oss models #7252

Uh oh!

Conversation

binghanc commented Aug 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karljang commented Aug 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

binghanc commented Sep 8, 2025

Uh oh!

github-actions bot commented Sep 8, 2025

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

binghanc commented Sep 8, 2025

Uh oh!

nv-guomingz commented Sep 8, 2025

Uh oh!

tensorrt-cicd commented Sep 8, 2025

Uh oh!

tensorrt-cicd commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

binghanc commented Aug 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 26, 2025 •

edited

Loading