KEMBAR78
[TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell by lfr-0531 · Pull Request #6710 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@lfr-0531
Copy link
Collaborator

@lfr-0531 lfr-0531 commented Aug 7, 2025

Description

Add DeepSeek-R1 FP8 accuracy tests on Blackwell.

Test Coverage

accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_fp8_blockscale[throughput]

Summary by CodeRabbit

  • Tests

    • Added accuracy and throughput tests for DeepSeek‑R1 with FP8_BLOCK_SCALES + FP8 KV-cache; test lists and timeout entries updated.
    • Test runtime/configuration now varies conditionally based on hardware platform version.
  • Documentation

    • Added accuracy reference entries reporting results for the new quantization configurations.
  • Chores

    • Expanded CI multi‑node PyTorch test matrix with an additional GB200 multi‑node stage.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 7, 2025

📝 Walkthrough

Walkthrough

Test adjusts MOE and KV-cache initialization in TestDeepSeekR1.test_fp8_blockscale based on SM version; accuracy YAMLs add FP8 kv_cache entries for DeepSeek-R1; the test is added to QA lists and GB200 multi-node timeout; Jenkins multi-node splits increased by one.

Changes

Cohort / File(s) Change Summary
Conditional Test Config Update
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Made moe_config and kv_cache_config conditional on get_sm_version() inside TestDeepSeekR1.test_fp8_blockscale, and added moe_config into pytorch_config.
Accuracy Reference Additions
tests/integration/defs/accuracy/references/gsm8k.yaml, tests/integration/defs/accuracy/references/mmlu.yaml
Added kv_cache_quant_algo: FP8 entries for deepseek-ai/DeepSeek-R1 under FP8_BLOCK_SCALES (retained existing entries); added a new Qwen3 quant entry in MMLU.
Test List Registration
tests/integration/test_lists/qa/llm_function_full.txt
Inserted accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_fp8_blockscale[throughput] into the QA LLM function list.
Test Timeout Update
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml
Added the new test to the 180s timeout list for GB200 8-GPU multi-node PyTorch post-merge tests.
Jenkins Multi-node Stage
jenkins/L0_Test.groovy
Added a new GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-7 entry and incremented split_count to 7 for GB200 multi-node PyTorch post-merge splits.

Sequence Diagram(s)

sequenceDiagram
    participant Runner
    participant TestDeepSeekR1
    participant SMChecker

    Runner->>TestDeepSeekR1: run test_fp8_blockscale
    TestDeepSeekR1->>SMChecker: get_sm_version()
    SMChecker-->>TestDeepSeekR1: sm_version
    alt sm_version == 100
        TestDeepSeekR1->>TestDeepSeekR1: set moe_config(backend="DEEPGEMM", max_num_tokens=16384)\nset kv_cache_config(free_gpu_memory_fraction=0.6)
    else
        TestDeepSeekR1->>TestDeepSeekR1: set moe_config(default)\nset kv_cache_config(free_gpu_memory_fraction=0.9)
    end
    TestDeepSeekR1->>Runner: proceed with selected configs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Possibly related PRs

Suggested labels

Community want to contribute

Suggested reviewers

  • litaotju
  • yizhang-nv
  • brb-nv
  • yuxianq
  • syuoni
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b9781e and 16f1394.

📒 Files selected for processing (2)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
🧬 Code Graph Analysis (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tensorrt_llm/_utils.py (1)
  • get_sm_version (681-683)
tests/integration/defs/conftest.py (1)
  • get_sm_version (1857-1860)
tensorrt_llm/llmapi/llm_args.py (1)
  • MoeConfig (166-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1634-1642: LGTM: Appropriate hardware-specific MoE backend selection.

The conditional logic correctly selects the DEEPGEMM backend for Blackwell B200 GPUs (SM version 100) while falling back to the default MoeConfig for other hardware. This aligns with the PR objective of adding DeepSeek-R1 FP8 accuracy tests specifically for Blackwell platform.

@lfr-0531
Copy link
Collaborator Author

lfr-0531 commented Aug 7, 2025

/bot run --post-merge

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14470 [ run ] triggered by Bot

@lfr-0531
Copy link
Collaborator Author

lfr-0531 commented Aug 7, 2025

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14475 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14470 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14475 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit da19e6e

@lfr-0531 lfr-0531 force-pushed the user/fanrongl/add_r1_fp8_blackwell_acc_test branch from cda210a to 3563e05 Compare August 7, 2025 15:33
@lfr-0531
Copy link
Collaborator Author

lfr-0531 commented Aug 7, 2025

/bot run --post-merge

@litaotju litaotju changed the title [None][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell [TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell Aug 7, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #14490 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14490 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10945 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

lfr-0531 commented Aug 8, 2025

/bot run --post-merge --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14545 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14545 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10989 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14688 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14688 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11086 completed with status: 'SUCCESS'

@lfr-0531
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14709 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14709 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11101 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run --disable-fail-fast --post-merge --only-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14740 [ run ] triggered by Bot

@lfr-0531
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14770 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15375 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15375 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11591 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15453 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15453 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11643 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15501 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15501 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11670 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15507 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15507 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11675 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15533 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15533 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11698 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15572 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15572 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11727 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15603 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15603 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11746 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15630 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15630 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11767 completed with status: 'SUCCESS'

@lfr-0531 lfr-0531 merged commit 816a120 into NVIDIA:main Aug 19, 2025
4 checks passed
@lfr-0531 lfr-0531 deleted the user/fanrongl/add_r1_fp8_blackwell_acc_test branch September 22, 2025 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants