-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 #7716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 #7716
Conversation
|
/bot run |
|
PR_Github #18593 [ run ] triggered by Bot |
📝 WalkthroughWalkthroughThe DeepGEMM submodule URL and branch were updated in .gitmodules, and its pointer was advanced. In torch_custom_ops.py, tactic selection was expanded to include an additional tactic for small M, and forward now dynamically dispatches between fp8_gemm_nt and fp8_gemm_ntt based on the selected tactic. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as PyTorch Op
participant Runner as fp8SwapABGemmRunner
participant Tactics as get_valid_tactics
participant Kernel0 as fp8_gemm_nt
participant Kernel1 as fp8_gemm_ntt
Caller->>Runner: forward(inputs, weights, ..., tactic)
alt tactic unspecified
Runner->>Tactics: get_valid_tactics(input.shape)
Tactics-->>Runner: [0,1] if M<=128 else [0]
note over Runner: Selects a tactic from valid list
end
alt tactic == 1
Runner->>Kernel1: fp8_gemm_ntt(..., disable_ue8m0_cast)
else tactic != 1
Runner->>Kernel0: fp8_gemm_nt(..., disable_ue8m0_cast)
end
Kernel0-->>Runner: output
Kernel1-->>Runner: output
Runner-->>Caller: output
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
.gitmodules (2)
32-32: Avoid tracking a moving branch for submodules; it hurts reproducibility.Unless you intentionally use
--remoteupdates, dropbranchor set it to.to ensure builds always use the recorded SHA.Apply one of:
- branch = swapab_sm100 + # Track the exact superproject-pinned commit for reproducible builds + branch = .or simply remove the line:
- branch = swapab_sm100
31-32: Optional: shallow‑clone the submodule to speed up CI.If full history isn’t needed, enable shallow clone.
[submodule "3rdparty/DeepGEMM"] path = 3rdparty/DeepGEMM url = https://github.com/ruoqianguo/DeepGEMM.git + shallow = true - branch = swapab_sm100 + branch = .3rdparty/DeepGEMM (3)
1-1: Pin to an org‑owned mirror or upstream; avoid personal forks.If the fork is required for SM100 work, mirror it into the NVIDIA org (read‑only) and point .gitmodules there. Document divergence and sync policy in THIRD_PARTY.md.
1-1: CI/Build guardrails for submodules.Ensure CI clones with submodules and fails fast on drift:
- Use: git submodule sync --recursive && git submodule update --init --recursive --depth 1
- Add a job to verify the recorded SHA matches workspace: git submodule status --recursive | awk '$1 !~ /^-/{exit 0} {exit 1}'
1-1: Runtime compatibility sanity checks (SM100 path).Given the branch name swapab_sm100 and the AI summary about tactic changes, gate kernels by compute capability and add A/B numeric checks (tactic 0 vs 1) for small‑M shapes in CI.
I can draft a minimal PyTest that exercises fp8_gemm_nt vs fp8_gemm_ntt across M∈{32,64,128}, random seeds, and dtypes to assert close() and perf thresholds. Want me to include it?
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
945-945: Remove empty line for consistency.- forward_func = deep_gemm.fp8_gemm_ntt if tactic == 1 else deep_gemm.fp8_gemm_nt
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.gitmodules(1 hunks)3rdparty/DeepGEMM(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
3rdparty/DeepGEMM (3)
1-1: License continuity check required — manual verificationGitHub API returned 404 for submodule 3rdparty/DeepGEMM; confirm the fork’s LICENSE matches approved terms (no additional clauses) and update NOTICE/THIRD_PARTY artifacts if the license or attribution differs. If the submodule is private/relocated, paste the LICENSE text or provide an accessible link.
1-1: Blocker — submodule points to a personal fork; do not merge until supply‑chain & license review completes.
- Location: .gitmodules → submodule 3rdparty/DeepGEMM
- Evidence: URL = https://github.com/ruoqianguo/DeepGEMM.git, branch = swapab_sm100, commit = 67e3c4d3d09b59405fd6e7698a33db747ed96533.
- Verification: automated GH metadata check failed (gh returned 404 / owner_repo parsed as ruoqianguo/DeepGEMM.git); repo/license/commit not confirmed.
- Action: either switch to an org-owned or upstream repo pinned to a specific commit (preferred), or provide an approved supply‑chain & license review for this personal fork before merging.
1-1: Attest determinism: ensure upstream commit 67e3c4d (67e3c4d3d09b59405fd6e7698a33db747ed96533) is tagged and protected
- Create an annotated tag pointing at that SHA in the submodule remote and push it.
- Protect the branch containing the commit (disable force-push).
- Add a CI check that verifies the annotated tag → commit mapping (fail if missing/mismatched).
Automated check returned HTTP 404 — re-run the supplied verification script with GH authentication or confirm manually that the tag and branch protection exist.
Location: 3rdparty/DeepGEMM (submodule)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2)
929-931: LGTM! Enhanced tactic selection for small batch sizes.The logic correctly expands tactic options to include both swap_ab variants (0 and 1) when input batch size is ≤ 128, which aligns with the PR's goal of optimizing small M performance with DeepGEMM.
946-952: LGTM! Dynamic dispatch implementation is correct.The implementation correctly selects between
fp8_gemm_ntt(tactic 1) andfp8_gemm_nt(default) based on the chosen tactic, with all parameters properly forwarded includingdisable_ue8m0_cast.
|
PR_Github #18593 [ run ] completed with state |
7ea492e to
7d65bf6
Compare
|
/bot run |
|
PR_Github #18873 [ run ] triggered by Bot |
|
PR_Github #18873 [ run ] completed with state |
7d65bf6 to
6c8f0c3
Compare
|
/bot run |
|
PR_Github #18898 [ run ] triggered by Bot |
|
PR_Github #18898 [ run ] completed with state |
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
6c8f0c3 to
d6e1515
Compare
|
/bot run |
|
PR_Github #19088 [ run ] triggered by Bot |
|
PR_Github #19088 [ run ] completed with state |
…c2 (NVIDIA#7716) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
…c2 (NVIDIA#7716) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Summary by CodeRabbit
Performance Improvements
Chores
Refactor