[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 #7716

Barry-Delaney · 2025-09-15T07:29:12Z

Summary by CodeRabbit

Performance Improvements
- Improved speed and responsiveness for small-matrix FP8 computations by enabling adaptive tactic selection, optimizing kernel choice automatically for smaller workloads.
Chores
- Updated third-party compute library reference to a new source and revision to ensure alignment with the latest optimizations.
Refactor
- Internal execution path now dynamically selects the most suitable compute method based on input size, improving efficiency without changing user-facing APIs or behavior.

Barry-Delaney · 2025-09-15T07:30:28Z

/bot run

tensorrt-cicd · 2025-09-15T07:35:58Z

PR_Github #18593 [ run ] triggered by Bot

coderabbitai · 2025-09-15T07:37:22Z

📝 Walkthrough

Walkthrough

The DeepGEMM submodule URL and branch were updated in .gitmodules, and its pointer was advanced. In torch_custom_ops.py, tactic selection was expanded to include an additional tactic for small M, and forward now dynamically dispatches between fp8_gemm_nt and fp8_gemm_ntt based on the selected tactic.

Changes

Cohort / File(s)	Summary of modifications
Submodule config `\.gitmodules`	Updated DeepGEMM URL to https://github.com/ruoqianguo/DeepGEMM.git and set branch to swapab_sm100.
Submodule pointer `3rdparty/DeepGEMM`	Advanced submodule commit from 7b6b556 to 67e3c4d.
Torch custom ops GEMM dispatch `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	get_valid_tactics now returns [0,1] for input.shape[0] <= 128; otherwise [0]. forward now selects fp8_gemm_nt for tactic 0 and fp8_gemm_ntt for tactic 1, replacing a fixed call with dynamic dispatch. Comments adjusted accordingly.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as PyTorch Op
  participant Runner as fp8SwapABGemmRunner
  participant Tactics as get_valid_tactics
  participant Kernel0 as fp8_gemm_nt
  participant Kernel1 as fp8_gemm_ntt

  Caller->>Runner: forward(inputs, weights, ..., tactic)
  alt tactic unspecified
    Runner->>Tactics: get_valid_tactics(input.shape)
    Tactics-->>Runner: [0,1] if M<=128 else [0]
    note over Runner: Selects a tactic from valid list
  end
  alt tactic == 1
    Runner->>Kernel1: fp8_gemm_ntt(..., disable_ue8m0_cast)
  else tactic != 1
    Runner->>Kernel0: fp8_gemm_nt(..., disable_ue8m0_cast)
  end
  Kernel0-->>Runner: output
  Kernel1-->>Runner: output
  Runner-->>Caller: output

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description contains only the placeholder "@coderabbitai summary" and does not follow the repository's required template: it is missing the detailed "Description" of the change and rationale, the "Test Coverage" section listing relevant tests and how to run them, and the completed "PR Checklist" items needed for reviewers to validate the change.	Please replace the placeholder with a full description following the template: add a "Description" explaining what was changed and why (including any behavioral impact of the DeepGEMM/submodule and tactic changes), a "Test Coverage" section listing specific tests and commands to validate the changes, and complete the "PR Checklist" (coding guidelines, tests added, dependency/licensing notes, CODEOWNERS/docs updates) so reviewers have the required context to approve the PR.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2" is concise, follows the repository's bracketed prefix format, and accurately summarizes the primary intent (cherry-picking DeepGEMM-related commits) without extraneous detail, so a reviewer scanning history can quickly understand the main change.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

.gitmodules (2)
32-32: Avoid tracking a moving branch for submodules; it hurts reproducibility.

Unless you intentionally use --remote updates, drop branch or set it to . to ensure builds always use the recorded SHA.

Apply one of:
-	branch = swapab_sm100
+	# Track the exact superproject-pinned commit for reproducible builds
+	branch = .
or simply remove the line:
-	branch = swapab_sm100
31-32: Optional: shallow‑clone the submodule to speed up CI.

If full history isn’t needed, enable shallow clone.
 [submodule "3rdparty/DeepGEMM"]
 	path = 3rdparty/DeepGEMM
 	url = https://github.com/ruoqianguo/DeepGEMM.git
+	shallow = true
-	branch = swapab_sm100
+	branch = .
3rdparty/DeepGEMM (3)

1-1: Pin to an org‑owned mirror or upstream; avoid personal forks.

If the fork is required for SM100 work, mirror it into the NVIDIA org (read‑only) and point .gitmodules there. Document divergence and sync policy in THIRD_PARTY.md.

1-1: CI/Build guardrails for submodules.

Ensure CI clones with submodules and fails fast on drift:

Use: git submodule sync --recursive && git submodule update --init --recursive --depth 1

Add a job to verify the recorded SHA matches workspace: git submodule status --recursive | awk '$1 !~ /^-/{exit 0} {exit 1}'

1-1: Runtime compatibility sanity checks (SM100 path).

Given the branch name swapab_sm100 and the AI summary about tactic changes, gate kernels by compute capability and add A/B numeric checks (tactic 0 vs 1) for small‑M shapes in CI.

I can draft a minimal PyTest that exercises fp8_gemm_nt vs fp8_gemm_ntt across M∈{32,64,128}, random seeds, and dtypes to assert close() and perf thresholds. Want me to include it?
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
945-945: Remove empty line for consistency.
-
        forward_func = deep_gemm.fp8_gemm_ntt if tactic == 1 else deep_gemm.fp8_gemm_nt

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5df0af and 9da50c0.

📒 Files selected for processing (3)

.gitmodules (1 hunks)
3rdparty/DeepGEMM (1 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

3rdparty/DeepGEMM (3)

1-1: License continuity check required — manual verification

GitHub API returned 404 for submodule 3rdparty/DeepGEMM; confirm the fork’s LICENSE matches approved terms (no additional clauses) and update NOTICE/THIRD_PARTY artifacts if the license or attribution differs. If the submodule is private/relocated, paste the LICENSE text or provide an accessible link.

1-1: Blocker — submodule points to a personal fork; do not merge until supply‑chain & license review completes.

Location: .gitmodules → submodule 3rdparty/DeepGEMM

Evidence: URL = https://github.com/ruoqianguo/DeepGEMM.git, branch = swapab_sm100, commit = 67e3c4d3d09b59405fd6e7698a33db747ed96533.

Verification: automated GH metadata check failed (gh returned 404 / owner_repo parsed as ruoqianguo/DeepGEMM.git); repo/license/commit not confirmed.

Action: either switch to an org-owned or upstream repo pinned to a specific commit (preferred), or provide an approved supply‑chain & license review for this personal fork before merging.

1-1: Attest determinism: ensure upstream commit 67e3c4d (67e3c4d3d09b59405fd6e7698a33db747ed96533) is tagged and protected

Create an annotated tag pointing at that SHA in the submodule remote and push it.

Protect the branch containing the commit (disable force-push).

Add a CI check that verifies the annotated tag → commit mapping (fail if missing/mismatched).

Automated check returned HTTP 404 — re-run the supplied verification script with GH authentication or confirm manually that the tag and branch protection exist.

Location: 3rdparty/DeepGEMM (submodule)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2)

929-931: LGTM! Enhanced tactic selection for small batch sizes.

The logic correctly expands tactic options to include both swap_ab variants (0 and 1) when input batch size is ≤ 128, which aligns with the PR's goal of optimizing small M performance with DeepGEMM.

946-952: LGTM! Dynamic dispatch implementation is correct.

The implementation correctly selects between fp8_gemm_ntt (tactic 1) and fp8_gemm_nt (default) based on the chosen tactic, with all parameters properly forwarded including disable_ue8m0_cast.

.gitmodules

tensorrt-cicd · 2025-09-15T11:12:16Z

PR_Github #18593 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13955 completed with status: 'SUCCESS'

Barry-Delaney · 2025-09-17T02:58:22Z

/bot run

tensorrt-cicd · 2025-09-17T03:03:51Z

PR_Github #18873 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-17T05:03:53Z

PR_Github #18873 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14148 completed with status: 'FAILURE'

Barry-Delaney · 2025-09-17T05:24:47Z

/bot run

tensorrt-cicd · 2025-09-17T05:30:35Z

PR_Github #18898 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-17T07:36:17Z

PR_Github #18898 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14167 completed with status: 'FAILURE'

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Barry-Delaney · 2025-09-18T02:22:08Z

/bot run

tensorrt-cicd · 2025-09-18T02:27:05Z

PR_Github #19088 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-18T05:51:46Z

PR_Github #19088 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14319 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

…c2 (NVIDIA#7716) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Barry-Delaney requested a review from a team as a code owner September 15, 2025 07:29

Barry-Delaney requested a review from liji-nv September 15, 2025 07:29

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

.gitmodules Show resolved Hide resolved

Barry-Delaney requested a review from litaotju September 15, 2025 12:30

liji-nv approved these changes Sep 16, 2025

View reviewed changes

Barry-Delaney force-pushed the user/barry/cherry_pick_dg branch 2 times, most recently from 7ea492e to 7d65bf6 Compare September 17, 2025 02:58

Barry-Delaney enabled auto-merge (squash) September 17, 2025 03:01

Barry-Delaney force-pushed the user/barry/cherry_pick_dg branch from 7d65bf6 to 6c8f0c3 Compare September 17, 2025 05:24

Barry-Delaney added 3 commits September 18, 2025 10:21

[None][feat] Support DeepGEMM swap-AB on sm100 (NVIDIA#7355)

e87e69b

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

[None][fix] Update DG side branch name (NVIDIA#7491)

bd7dbba

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

[None][fix] Update DG commit (NVIDIA#7534)

d6e1515

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Barry-Delaney force-pushed the user/barry/cherry_pick_dg branch from 6c8f0c3 to d6e1515 Compare September 18, 2025 02:21

Barry-Delaney merged commit 4f0e6b5 into NVIDIA:main Sep 18, 2025
5 checks passed

Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025

[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0r…

a452a46

…c2 (NVIDIA#7716) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

MrGeva pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Sep 21, 2025

[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0r…

37cc52b

…c2 (NVIDIA#7716) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

coderabbitai bot mentioned this pull request Oct 20, 2025

[None][feat] Update 3rdparty/DeepGEMM to latest commit #8488

Merged

1 task

[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 #7716

[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 #7716

Conversation

Barry-Delaney commented Sep 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Barry-Delaney commented Sep 15, 2025

Uh oh!

tensorrt-cicd commented Sep 15, 2025

Uh oh!

coderabbitai bot commented Sep 15, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Sep 15, 2025

Uh oh!

Barry-Delaney commented Sep 17, 2025

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

Barry-Delaney commented Sep 17, 2025

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

Barry-Delaney commented Sep 18, 2025

Uh oh!

tensorrt-cicd commented Sep 18, 2025

Uh oh!

tensorrt-cicd commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Barry-Delaney commented Sep 15, 2025 •

edited by coderabbitai bot

Loading