KEMBAR78
[None][docs] refine docs for accuracy evaluation of gpt-oss models by binghanc · Pull Request #7252 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@binghanc
Copy link
Contributor

@binghanc binghanc commented Aug 26, 2025

add docs for accuracy evaluation of gpt-oss models.

Summary by CodeRabbit

  • Documentation
    • Added guidance for configuring and running accuracy evaluations for GPT-OSS on TRT-LLM.
    • Clarified required server parameters (attention DP, tensor/experts parallelism, max batch size, max tokens).
    • Documented use of the reasoning-effort flag when launching evaluations.
    • Included a reference table mapping reasoning-effort levels to parallel settings and token/batch limits.
    • Provided an example command for running evaluations (chat_completions, GPQA/AIME25, 120B model, low/medium effort).
    • Inserted guidance in the evaluation intro and optional verification sections.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

📝 Walkthrough

Walkthrough

Adds documentation to the GPT‑OSS on TRT‑LLM quick-start guide explaining required TRT‑LLM server flags and gpt‑oss evaluation options, provides a mapping table of reasoning‑effort to parallel config and limits, and includes an example eval command. Changes are documentation-only and inserted in two locations on the page.

Changes

Cohort / File(s) Summary
Docs: GPT‑OSS on TRT‑LLM Quick Start
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
Added accuracy evaluation guidance: required TRT‑LLM server flags (enable_attention_dp, tp_size, ep_size, max_batch_size, max_num_tokens), reasoning-effort for gpt‑oss evals, a reference table mapping reasoning-effort to parallel configuration and batch/token limits (four rows), and an example gpt_oss.evals command. Inserted after the evaluation intro and under “Running Evaluations to Verify Accuracy (Optional)”.

Sequence Diagram(s)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • dongfengy
  • QiJune
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (2)

237-237: Capitalize product names and tighten the instruction sentence

Use consistent branding and flag notation. Also add the missing serial comma.

-You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200. 
+Set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size`, and `max_num_tokens` when launching the TRT-LLM server, and set `--reasoning-effort` when launching evaluations in GPT-OSS. Below are reference configurations for accuracy evaluation on B200.

239-245: Fix table grammar/formatting and clarify what DP/TP/EP mean

  • Grammar/formatting: header-casing, spacing around operators, thousands separators, and clearer alternatives (“or …”) improve readability and address the grammar hints.
  • Add a short note to disambiguate DP vs TP/EP and to remind readers to align server flags with the table values.
-| **reasoning-effort** | **parallel configuration** | **max_batch_size** | **max_num_tokens** |
-|:--------------------:|:--------------------------:|:------------------:|:------------------:|
-| low/medium           | DP+TP8+EP8 / DP+TP4+EP4    | 128                | 32768              |
-| high                 | DP+TP8+EP8 / DP+TP4+EP4    | 2                  | 133120             |
-| low/medium           | TP8 / TP4                  | 1024               | 32768              |
-| high                 | TP8 / TP4                  | 16                 | 133120             |
+| Reasoning effort | Parallel config                  | Max batch size | Max num tokens |
+|:----------------:|:---------------------------------|---------------:|---------------:|
+| low, medium      | DP + TP8 + EP8 (or DP + TP4 + EP4) |            128 |        33,120 |
+| high             | DP + TP8 + EP8 (or DP + TP4 + EP4) |              2 |       133,120 |
+| low, medium      | TP8 (or TP4)                       |          1,024 |        32,768 |
+| high             | TP8 (or TP4)                       |             16 |       133,120 |
+
+Note: DP here refers to attention data parallel. Enable it via `enable_attention_dp: true` in the YAML and launch across multiple GPUs; TP maps to `--tp_size`, EP maps to `--ep_size`. Ensure your server `--max_num_tokens` is set to at least the value shown in the table for the chosen configuration.

Note: If 33,120 in the first row was intended to be 32,768 (32K), please correct it accordingly.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 23ed0c8 and 9bcc11b.

📒 Files selected for processing (1)
  • docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
🪛 LanguageTool
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

[grammar] ~239-~239: There might be a mistake here.
Context: ...max_batch_size* | max_num_tokens | |:--------------------:|:---------------...

(QB_NEW_EN)


[grammar] ~240-~240: There might be a mistake here.
Context: ...-----------------:|:------------------:| | low/medium | DP+TP8+EP8 / DP...

(QB_NEW_EN)


[grammar] ~241-~241: There might be a mistake here.
Context: ...28 | 32768 | | high | DP+TP8+EP8 / DP...

(QB_NEW_EN)


[grammar] ~242-~242: There might be a mistake here.
Context: ... | 133120 | | low/medium | TP8 / TP4 ...

(QB_NEW_EN)


[grammar] ~243-~243: There might be a mistake here.
Context: ...024 | 32768 | | high | TP8 / TP4 ...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 26, 2025
@karljang
Copy link
Collaborator

@binghanc , thanks for the contribution~

  • Could you add "[None][docs]" to the title to address the failure of Check PR Title Format (pull_request)?
  • Also, could you check above comments from @dongfengy ?

@binghanc binghanc changed the title Binghanc/gpt oss docs [None][docs] Binghanc/gpt oss docs Sep 1, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1)

246-255: Point evals to your local server; fix dataset/model naming and hyphenation

  • Prepend env vars so gpt_oss.evals hits your local TRT‑LLM endpoint (otherwise it may call OpenAI by default).
  • In prose: “reasoning effort” and “AIME 2025”.
  • Use the served model id consistently: openai/gpt-oss-120b (server and Basic Test use this).
  • Keep the dataset flag --eval gpqa,aime25 but align the prose to “AIME 2025”.
-Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME2025.
+Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning effort on GPQA and AIME 2025.
 
 ```shell
-# execute this command in gpt-oss
-python -m gpt_oss.evals \
+# In the gpt-oss repository, target your local TRT-LLM server:
+export OPENAI_API_KEY="EMPTY"                # Any non-empty value is accepted
+export OPENAI_BASE_URL="http://localhost:8000/v1"
+
+# Then run:
+python -m gpt_oss.evals \
   --sampler chat_completions \
   --eval gpqa,aime25 \
-  --model gpt-oss-120b \
+  --model openai/gpt-oss-120b \
   --reasoning-effort low,medium

+Reminder: Ensure the TRT-LLM server was launched with a --max_num_tokens value compatible with the table above.


</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

**💡 Knowledge Base configuration:**

- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 9bcc11b7a7ccc586705f98ff9cffd16cb3fd63fe and 2ac1ef6200253ec2ced877524ce1b9a79d04e5a6.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧠 Learnings (1)</summary>

<details>
<summary>📚 Learning: 2025-07-28T17:06:08.621Z</summary>

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.


**Applied to files:**
- `docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md`

</details>

</details><details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md</summary>

[grammar] ~239-~239: There might be a mistake here.
Context: ...*max_batch_size** | **max_num_tokens** | |:--------------------:|:---------------...

(QB_NEW_EN)

---

[grammar] ~240-~240: There might be a mistake here.
Context: ...-----------------:|:------------------:| | low/medium           | DEP8 / DEP4    ...

(QB_NEW_EN)

---

[grammar] ~241-~241: There might be a mistake here.
Context: ...28                | 32768              | | high                 | DEP8 / DEP4    ...

(QB_NEW_EN)

---

[grammar] ~242-~242: There might be a mistake here.
Context: ...                  | 133120             | | low/medium           | TP8 / TP4      ...

(QB_NEW_EN)

---

[grammar] ~243-~243: There might be a mistake here.
Context: ...024               | 32768              | | high                 | TP8 / TP4      ...

(QB_NEW_EN)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: Pre-commit Check

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@binghanc binghanc force-pushed the binghanc/gpt-oss-docs branch from 2ac1ef6 to 0a257e8 Compare September 2, 2025 04:26
@binghanc binghanc marked this pull request as ready for review September 2, 2025 04:48
@binghanc binghanc requested a review from a team as a code owner September 2, 2025 04:48
@binghanc binghanc changed the title [None][docs] Binghanc/gpt oss docs [None][docs] binghanc/gpt oss docs Sep 2, 2025
@binghanc binghanc changed the title [None][docs] binghanc/gpt oss docs [None][docs] refine docs for gpt-oss Sep 2, 2025
@binghanc binghanc changed the title [None][docs] refine docs for gpt-oss [None][docs] refine docs for accuracy evaluation of gpt-oss models Sep 2, 2025
@binghanc
Copy link
Contributor Author

binghanc commented Sep 8, 2025

/bot skip

@github-actions
Copy link

github-actions bot commented Sep 8, 2025

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@binghanc
Copy link
Contributor Author

binghanc commented Sep 8, 2025

/bot skip --comments "Modifying the document does not require code testing"

Signed-off-by: binghanc <binghanc@nvidia.com>
Signed-off-by: binghanc <binghanc@nvidia.com>
@nv-guomingz nv-guomingz force-pushed the binghanc/gpt-oss-docs branch from 6a89fc4 to 0e7fdcd Compare September 8, 2025 01:24
@nv-guomingz
Copy link
Collaborator

/bot skip --comment "docs change only"

@nv-guomingz nv-guomingz enabled auto-merge (squash) September 8, 2025 01:26
@nv-guomingz nv-guomingz disabled auto-merge September 8, 2025 01:28
@tensorrt-cicd
Copy link
Collaborator

PR_Github #17959 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17959 [ skip ] completed with state SUCCESS
Skipping testing for commit 0e7fdcd

@nv-guomingz nv-guomingz merged commit 14ee43e into NVIDIA:main Sep 8, 2025
5 checks passed
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
…VIDIA#7252)

Signed-off-by: 176802681+binghanc@users.noreply.github.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants