KEMBAR78
[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 by Barry-Delaney · Pull Request #7716 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@Barry-Delaney
Copy link
Collaborator

@Barry-Delaney Barry-Delaney commented Sep 15, 2025

Summary by CodeRabbit

  • Performance Improvements

    • Improved speed and responsiveness for small-matrix FP8 computations by enabling adaptive tactic selection, optimizing kernel choice automatically for smaller workloads.
  • Chores

    • Updated third-party compute library reference to a new source and revision to ensure alignment with the latest optimizations.
  • Refactor

    • Internal execution path now dynamically selects the most suitable compute method based on input size, improving efficiency without changing user-facing APIs or behavior.

@Barry-Delaney Barry-Delaney requested a review from a team as a code owner September 15, 2025 07:29
@Barry-Delaney
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18593 [ run ] triggered by Bot

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

📝 Walkthrough

Walkthrough

The DeepGEMM submodule URL and branch were updated in .gitmodules, and its pointer was advanced. In torch_custom_ops.py, tactic selection was expanded to include an additional tactic for small M, and forward now dynamically dispatches between fp8_gemm_nt and fp8_gemm_ntt based on the selected tactic.

Changes

Cohort / File(s) Summary of modifications
Submodule config
\.gitmodules
Updated DeepGEMM URL to https://github.com/ruoqianguo/DeepGEMM.git and set branch to swapab_sm100.
Submodule pointer
3rdparty/DeepGEMM
Advanced submodule commit from 7b6b556 to 67e3c4d.
Torch custom ops GEMM dispatch
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
get_valid_tactics now returns [0,1] for input.shape[0] <= 128; otherwise [0]. forward now selects fp8_gemm_nt for tactic 0 and fp8_gemm_ntt for tactic 1, replacing a fixed call with dynamic dispatch. Comments adjusted accordingly.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as PyTorch Op
  participant Runner as fp8SwapABGemmRunner
  participant Tactics as get_valid_tactics
  participant Kernel0 as fp8_gemm_nt
  participant Kernel1 as fp8_gemm_ntt

  Caller->>Runner: forward(inputs, weights, ..., tactic)
  alt tactic unspecified
    Runner->>Tactics: get_valid_tactics(input.shape)
    Tactics-->>Runner: [0,1] if M<=128 else [0]
    note over Runner: Selects a tactic from valid list
  end
  alt tactic == 1
    Runner->>Kernel1: fp8_gemm_ntt(..., disable_ue8m0_cast)
  else tactic != 1
    Runner->>Kernel0: fp8_gemm_nt(..., disable_ue8m0_cast)
  end
  Kernel0-->>Runner: output
  Kernel1-->>Runner: output
  Runner-->>Caller: output
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description contains only the placeholder "@coderabbitai summary" and does not follow the repository's required template: it is missing the detailed "Description" of the change and rationale, the "Test Coverage" section listing relevant tests and how to run them, and the completed "PR Checklist" items needed for reviewers to validate the change. Please replace the placeholder with a full description following the template: add a "Description" explaining what was changed and why (including any behavioral impact of the DeepGEMM/submodule and tactic changes), a "Test Coverage" section listing specific tests and commands to validate the changes, and complete the "PR Checklist" (coding guidelines, tests added, dependency/licensing notes, CODEOWNERS/docs updates) so reviewers have the required context to approve the PR.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2" is concise, follows the repository's bracketed prefix format, and accurately summarizes the primary intent (cherry-picking DeepGEMM-related commits) without extraneous detail, so a reviewer scanning history can quickly understand the main change.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
.gitmodules (2)

32-32: Avoid tracking a moving branch for submodules; it hurts reproducibility.

Unless you intentionally use --remote updates, drop branch or set it to . to ensure builds always use the recorded SHA.

Apply one of:

-	branch = swapab_sm100
+	# Track the exact superproject-pinned commit for reproducible builds
+	branch = .

or simply remove the line:

-	branch = swapab_sm100

31-32: Optional: shallow‑clone the submodule to speed up CI.

If full history isn’t needed, enable shallow clone.

 [submodule "3rdparty/DeepGEMM"]
 	path = 3rdparty/DeepGEMM
 	url = https://github.com/ruoqianguo/DeepGEMM.git
+	shallow = true
-	branch = swapab_sm100
+	branch = .
3rdparty/DeepGEMM (3)

1-1: Pin to an org‑owned mirror or upstream; avoid personal forks.

If the fork is required for SM100 work, mirror it into the NVIDIA org (read‑only) and point .gitmodules there. Document divergence and sync policy in THIRD_PARTY.md.


1-1: CI/Build guardrails for submodules.

Ensure CI clones with submodules and fails fast on drift:

  • Use: git submodule sync --recursive && git submodule update --init --recursive --depth 1
  • Add a job to verify the recorded SHA matches workspace: git submodule status --recursive | awk '$1 !~ /^-/{exit 0} {exit 1}'

1-1: Runtime compatibility sanity checks (SM100 path).

Given the branch name swapab_sm100 and the AI summary about tactic changes, gate kernels by compute capability and add A/B numeric checks (tactic 0 vs 1) for small‑M shapes in CI.

I can draft a minimal PyTest that exercises fp8_gemm_nt vs fp8_gemm_ntt across M∈{32,64,128}, random seeds, and dtypes to assert close() and perf thresholds. Want me to include it?

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

945-945: Remove empty line for consistency.

-
        forward_func = deep_gemm.fp8_gemm_ntt if tactic == 1 else deep_gemm.fp8_gemm_nt
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5df0af and 9da50c0.

📒 Files selected for processing (3)
  • .gitmodules (1 hunks)
  • 3rdparty/DeepGEMM (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
3rdparty/DeepGEMM (3)

1-1: License continuity check required — manual verification

GitHub API returned 404 for submodule 3rdparty/DeepGEMM; confirm the fork’s LICENSE matches approved terms (no additional clauses) and update NOTICE/THIRD_PARTY artifacts if the license or attribution differs. If the submodule is private/relocated, paste the LICENSE text or provide an accessible link.


1-1: Blocker — submodule points to a personal fork; do not merge until supply‑chain & license review completes.

  • Location: .gitmodules → submodule 3rdparty/DeepGEMM
  • Evidence: URL = https://github.com/ruoqianguo/DeepGEMM.git, branch = swapab_sm100, commit = 67e3c4d3d09b59405fd6e7698a33db747ed96533.
  • Verification: automated GH metadata check failed (gh returned 404 / owner_repo parsed as ruoqianguo/DeepGEMM.git); repo/license/commit not confirmed.
  • Action: either switch to an org-owned or upstream repo pinned to a specific commit (preferred), or provide an approved supply‑chain & license review for this personal fork before merging.

1-1: Attest determinism: ensure upstream commit 67e3c4d (67e3c4d3d09b59405fd6e7698a33db747ed96533) is tagged and protected

  • Create an annotated tag pointing at that SHA in the submodule remote and push it.
  • Protect the branch containing the commit (disable force-push).
  • Add a CI check that verifies the annotated tag → commit mapping (fail if missing/mismatched).

Automated check returned HTTP 404 — re-run the supplied verification script with GH authentication or confirm manually that the tag and branch protection exist.

Location: 3rdparty/DeepGEMM (submodule)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2)

929-931: LGTM! Enhanced tactic selection for small batch sizes.

The logic correctly expands tactic options to include both swap_ab variants (0 and 1) when input batch size is ≤ 128, which aligns with the PR's goal of optimizing small M performance with DeepGEMM.


946-952: LGTM! Dynamic dispatch implementation is correct.

The implementation correctly selects between fp8_gemm_ntt (tactic 1) and fp8_gemm_nt (default) based on the chosen tactic, with all parameters properly forwarded including disable_ue8m0_cast.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18593 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13955 completed with status: 'SUCCESS'

@Barry-Delaney Barry-Delaney force-pushed the user/barry/cherry_pick_dg branch 2 times, most recently from 7ea492e to 7d65bf6 Compare September 17, 2025 02:58
@Barry-Delaney
Copy link
Collaborator Author

/bot run

@Barry-Delaney Barry-Delaney enabled auto-merge (squash) September 17, 2025 03:01
@tensorrt-cicd
Copy link
Collaborator

PR_Github #18873 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18873 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14148 completed with status: 'FAILURE'

@Barry-Delaney Barry-Delaney force-pushed the user/barry/cherry_pick_dg branch from 7d65bf6 to 6c8f0c3 Compare September 17, 2025 05:24
@Barry-Delaney
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18898 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18898 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14167 completed with status: 'FAILURE'

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
@Barry-Delaney Barry-Delaney force-pushed the user/barry/cherry_pick_dg branch from 6c8f0c3 to d6e1515 Compare September 18, 2025 02:21
@Barry-Delaney
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19088 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19088 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14319 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Barry-Delaney Barry-Delaney merged commit 4f0e6b5 into NVIDIA:main Sep 18, 2025
5 checks passed
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
…c2 (NVIDIA#7716)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
MrGeva pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Sep 21, 2025
…c2 (NVIDIA#7716)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants