-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP #6973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds a new fused MoE communication implementation (kernels, headers, host APIs, workspace management, and tests), removes the legacy moeComm kernels, refactors prepare kernels to per-slot statics, updates Torch bindings and Python utilities to a fused multi-tensor workflow, and adds comprehensive CUDA unit tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Py as Python
participant TH as Torch C++ binding
participant FM as FusedMoE C++ APIs
participant K as CUDA Kernels
Py->>TH: moe_initialize_workspace(all_workspaces, ep_rank, ep_size)
TH->>FM: initializeFusedMoeLocalWorkspace(...)
FM->>K: workspace init kernel(s)
K-->>FM: done
FM-->>TH: return
TH-->>Py: return
Py->>TH: moe_comm([inputs], send_cumsum, send_idx, recv_cumsum, recv_idx, all_workspaces, output_alloc_count, ep_rank, ep_size)
TH->>FM: moeAllToAll(params, workspace, stream)
FM->>K: moeAllToAllKernel<<<...>>>(...)
K-->>FM: transfers complete
FM-->>TH: return
TH-->>Py: return
Py->>TH: memset_expert_ids(expert_ids, recv_cumsum, max_tokens, top_k, slot_count, ep_size)
TH->>K: memsetExpertIds kernel
K-->>TH: done
TH-->>Py: return
sequenceDiagram
participant Mod as WideEPMoE
participant Util as _mnnvl_utils
participant TH as Torch C++ binding
participant FM as FusedMoE C++ APIs
participant K as CUDA Kernels
Mod->>Util: mnnvl_moe_alltoallv_prepare_without_allgather(...)
Util-->>Mod: alltoall_info, gathered_statics
Mod->>Util: mnnvl_moe_alltoallv([x, x_sf, slots, scales], alltoall_info, workspace, ep_rank, ep_size)
Util->>TH: moe_comm([...], ..., all_workspaces, output_alloc_count, ep_rank, ep_size)
TH->>FM: moeAllToAll(...)
FM->>K: moeAllToAllKernel<<<...>>>(...)
K-->>FM: done
FM-->>TH: return
TH-->>Util: return
Util-->>Mod: [x', x_sf', slots', scales']
Mod->>TH: memset_expert_ids(...)
TH-->>Mod: return
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run |
PR_Github #15561 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🔭 Outside diff range comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
195-198
: Inconsistent top_k guard removal.The comment on line 194 indicates "alltoall without allgather only supports top_k % 4 == 0", but the condition was not fully removed. This creates inconsistency between the comment and actual implementation. The enable_alltoall property now checks if
experts_per_token % 4 == 0
rather than supporting generic top_k values as the broader changes suggest.Either update the comment to reflect the actual constraint or fully remove the
% 4 == 0
check:- and self.routing_method.experts_per_token % 4 == - 0 # alltoall without allgather only supports top_k % 4 == 0 + # Generic top_k values now supportedcpp/tensorrt_llm/thop/moeCommOp.cpp (1)
141-209
: Missing validation for expert count divisibility.Line 150-151 checks that
expertCount + 1 <= 512
, but this check seems incomplete. The error message is also not descriptive enough about why this constraint exists.CHECK_INPUT(expertsIds, torch::kInt32); TORCH_CHECK(expertCount % 4 == 0, "expertCount must be divisible by 4"); TORCH_CHECK(slotCount % 4 == 0, "slotCount must be divisible by 4"); - TORCH_CHECK(expertCount + 1 <= 512, - "expertCount + 1 is larger than 512"); + TORCH_CHECK(expertCount + 1 <= 512, + "expertCount + 1 (" + std::to_string(expertCount + 1) + + ") exceeds maximum supported value of 512");
🧹 Nitpick comments (9)
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (2)
18-18
: Prefer C++ headers: use instead of <stdint.h>.Use C++ headers to align with coding standards and avoid pulling symbols into the global namespace unexpectedly.
-#include <stdint.h> +#include <cstdint>
31-33
: Potential duplication/confusion of WARP constants across the codebase.
WARP_SIZE
andWARP_MASK
are also defined/used elsewhere (e.g., FusedMoeCommunicator, kernels .cu). Consider centralizing these definitions in this common header and consuming them via this header only to avoid divergence.cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1322-1326
: Magic number for 48 KiB dynamic shared memory threshold.Prefer a named constexpr (e.g., kDEFAULT_SMEM_LIMIT_PER_BLOCK) instead of
48 * 1024
to document intent. It also eases future adjustments.cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
109-121
: Variable shadowing inside getFieldUncompactSize; use a distinct name or the constant directly.
alignedUnitBytes
is re-declared as a constexpr within theif
block, shadowing the outer variable. This is confusing and error-prone. Use a different name or the constant directly.- if (alignedUnitBytes != 16) + if (alignedUnitBytes != 16) { - constexpr int alignedUnitBytes = BYTES_PER_16B_BLOCK; - currentFieldSize = currentFieldSize / alignedUnitBytes * alignedUnitBytes; - currentFieldSize += alignedUnitBytes * 2; + currentFieldSize = (currentFieldSize / BYTES_PER_16B_BLOCK) * BYTES_PER_16B_BLOCK; + currentFieldSize += BYTES_PER_16B_BLOCK * 2; }
260-318
: Duplicate WARP_SIZE definition vs common header.
FusedMoeCommunicator
defines its ownWARP_SIZE = 32
, whilemoeCommKernelsCommon.h
also definesWARP_SIZE
andWARP_MASK
. Prefer a single source of truth (e.g., the common header) to reduce risk of divergence.cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
31-36
: Consider adding alignment checks for field info.The function accepts a 2D tensor and extracts element size and stride information. Consider validating that the tensor's memory layout meets any alignment requirements that the fused MOE kernels might expect.
void setMoeCommFieldInfo(tensorrt_llm::kernels::MoeCommFieldInfo& fieldInfo, torch::Tensor const& tensor) { TORCH_CHECK(tensor.dim() == 2, "tensor must be a 2D tensor"); int eltSize = tensor.dtype().itemsize(); + // Validate that the tensor is contiguous or meets alignment requirements + TORCH_CHECK(tensor.is_contiguous(), "tensor must be contiguous for optimal MOE communication"); fieldInfo.fillFieldInfo(static_cast<uint8_t*>(tensor.data_ptr()), eltSize, tensor.size(1), tensor.stride(0)); }
231-231
: Add namespace closing comment.According to the coding guidelines, namespaces should be closed with a comment naming the namespace.
-} // namespace torch_ext +} // namespace torch_extcpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (2)
314-314
: Use EXPECT_FLOAT_EQ for float comparisons.For floating-point comparisons, prefer
EXPECT_FLOAT_EQ
overEXPECT_NEAR
with a fixed epsilon when comparing exact values.- EXPECT_NEAR(expected, actual, 1e-6f) << "Scale mismatch at warp=" << tokenId << ", k=" << k; + EXPECT_FLOAT_EQ(expected, actual) << "Scale mismatch at warp=" << tokenId << ", k=" << k;
1280-1289
: Consistent cleanup pattern needed.The cleanup section has inconsistent patterns for checking null pointers before deletion. Array deletion is safe with null pointers in C++.
// Cleanup temporary arrays - if (resultRecvTokenSlots) - delete[] resultRecvTokenSlots; - if (resultRecvScales) - delete[] resultRecvScales; + delete[] resultRecvTokenSlots; // Safe even if null + delete[] resultRecvScales; // Safe even if null for (int i = 0; i < fieldCount; i++) { - if (resultRecvFields[i]) - delete[] resultRecvFields[i]; + delete[] resultRecvFields[i]; // Safe even if null }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (16)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(1 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(1 hunks)cpp/tensorrt_llm/kernels/moeCommKernels.cu
(0 hunks)cpp/tensorrt_llm/kernels/moeCommKernels.h
(0 hunks)cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
(1 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.cu
(7 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.h
(3 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(6 hunks)cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
(0 hunks)cpp/tests/unit_tests/kernels/CMakeLists.txt
(1 hunks)cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
(1 hunks)tensorrt_llm/_mnnvl_utils.py
(5 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(0 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(5 hunks)tests/unittest/_torch/thop/test_moe_alltoall.py
(8 hunks)
💤 Files with no reviewable changes (4)
- cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
- cpp/tensorrt_llm/kernels/moeCommKernels.cu
- tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
- cpp/tensorrt_llm/kernels/moeCommKernels.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/moePrepareKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
tensorrt_llm/_mnnvl_utils.py
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_mnnvl_utils.py
🧠 Learnings (2)
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🪛 Ruff (0.12.2)
tests/unittest/_torch/thop/test_moe_alltoall.py
919-919: Redefinition of unused TestMoeAlltoAllFP8SingleGPU
from line 319
(F811)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
465-465: Line too long (124 > 120)
(E501)
🔇 Additional comments (22)
tensorrt_llm/_mnnvl_utils.py (3)
369-373
: Workspace initialization + barrier looks correct.Initializing the workspace via
torch.ops.trtllm.moe_initialize_workspace(...)
and then hitting a communicator barrier ensures ranks see a consistent FIFO state before first use.
541-571
: Dimension validation and preallocation path looks solid.
- Validates 2D tensors and consistent first dimension for non-None entries.
- Pre-allocates outputs with correct sizes and dtypes/devices.
- Executes a single fused
moe_comm
call across the valid subset.
616-625
: Updated moe_comm invocation in combine() matches fused list-based API.Wrapping inputs/outputs in 1-element lists is the right adaptation for the fused path here.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)
50-67
: Host-side helpers inside SendRecvIndices conditionally compiled only under CUDACC.If these helpers are intended for host code in non-CUDA unit tests or CPU stubs, consider moving
__host__
versions outside the#ifdef __CUDACC__
guard. If device-only by design, this is fine.cpp/tests/unit_tests/kernels/CMakeLists.txt (1)
45-46
: Nice addition: fused MoE comm kernel test registered.Registering
fusedMoeCommKernelTest
improves coverage for the new fused path. Ensure the target links the correct CUDA archs (SM80/SM90) depending on CI matrix so the inline PTX guards are exercised.tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)
372-374
: Good addition of expert ID population.The explicit call to
memset_expert_ids
ensures proper initialization of expert IDs using the alltoall_info metadata. This is a crucial step that was previously handled implicitly.
376-379
: FP4 scaling factor shape handling looks correct.The code properly reshapes the scaling factor tensor after the alltoall dispatch and applies swizzle_sf transformation when needed. The conditional check ensures this only happens when x_sf exists.
357-360
: No cleanup needed: API still requiresexpert_statics
argument
The signature ofmnnvl_moe_alltoallv_prepare_without_allgather
in_mnnvl_utils.py
still declares the second parameter asexpert_statics: Optional[torch.Tensor]and several call sites (e.g. in
fused_moe_wide_ep.py
and the unit tests) pass a real tensor for this argument. Removing theNone
infused_moe_cutlass.py
would break the API consistency. Leave theNone
so that the wrapper continues to satisfy the current signature.Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
299-300
: Simplified control flow with MNNVL alltoall.The
can_use_alltoall
method now unconditionally returnsTrue
for MNNVL, enforcing alltoall usage. This simplification is consistent with the two-stage prepare/dispatch flow but removes flexibility for runtime decision-making.
867-893
: Well-structured two-stage alltoall API.The new
alltoall_prepare
andalltoall_dispatch
methods provide a clean separation of concerns. The prepare phase handles metadata computation while dispatch performs the actual data transfer. This design allows for potential optimizations between the two phases.
465-474
: Ensure correct handling of local statistics across repeated callsPlease verify that when
repeat_count > 1
, the local‐statistic tensor is only gathered on the final invocation and that earlier calls correctly passNone
intoalltoall_prepare
. In particular:
- Confirm that
self.layer_load_balancer.get_local_statistic_tensor()
is invoked only when
is_last_call
isTrue
self.layer_load_balancer
exists and reports non‐static routing.- Verify that for all preceding calls (
is_last_call == False
),loadbalancer_local_statistic_info
is indeedNone
, and thatgathered_loadbalancer_local_statistic_info
produced byalltoall_prepare
is handled correctly (i.e., remainsNone
or merges properly).- Add or extend unit/integration tests to simulate
repeat_count > 1
and inspect the sequence of tensors through these branches.Affected block (tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py, ~lines 465–474):
- if is_last_call and self.layer_load_balancer is not None and not self.layer_load_balancer.is_static_routing(): - loadbalancer_local_statistic_info = self.layer_load_balancer.get_local_statistic_tensor() - else: - loadbalancer_local_statistic_info = None + # Only gather local stats on the final call when dynamic routing is enabled + if is_last_call + and self.layer_load_balancer is not None + and not self.layer_load_balancer.is_static_routing(): + loadbalancer_local_statistic_info = ( + self.layer_load_balancer.get_local_statistic_tensor() + ) + else: + loadbalancer_local_statistic_info = Nonetests/unittest/_torch/thop/test_moe_alltoall.py (3)
81-84
: Workspace initialization update looks correct.The workspace size is now divided by 8 (uint64 elements) and initialized using the new
moe_initialize_workspace
API. This aligns with the broader workspace management changes in the PR.
160-180
: Good multi-tensor test coverage.The test now properly handles multiple tensor dimensions by creating separate input/output tensors for each dimension in vector_dims. The per-tensor orchestration and verification logic is well-structured.
394-399
: Inconsistent workspace size handling in FP8 test.The FP8 test uses
workspace_size
directly as the second dimension of the workspace tensor, while other tests divide it by 8. This inconsistency could cause issues.Apply the same workspace size handling:
workspace_size = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(1) all_workspaces = torch.zeros(1, - workspace_size, + workspace_size // 8, dtype=torch.uint64, device='cuda') + torch.ops.trtllm.moe_initialize_workspace(all_workspaces, 0, 1)⛔ Skipped due to learnings
Learnt from: sklevtsov-nvidia PR: NVIDIA/TensorRT-LLM#3294 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127 Timestamp: 2025-08-09T20:57:04.084Z Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (5)
56-78
: Per-slot counter implementation looks correct.The refactored
CounterCommunicator
now operates on a per-slot values array with proper synchronization using acquire/release semantics. The +1/-1 offset ensures 0 is never used as a valid value, avoiding busy-wait on uninitialized memory.
85-88
: Function signature properly extended for statistics.The
computeCountAndSendStatics
function now acceptsexpertStatics
,slotCount
, andexpertCount
parameters to support the new per-slot statistics-based workflow. The naming accurately reflects the extended functionality.
126-137
: Statistics communication logic is well-structured.The code correctly handles both the count (i==0) and expert statistics (i>0) in a unified loop. The conditional logic ensures backward compatibility when expertStatics is nullptr.
328-330
: Simplified workspace size calculation.The workspace size is now simply
sizeof(MoeCommFifoConnInfo) * epSize
, which is much cleaner than the previous FIFO-based formula. This aligns with the per-slot counter approach.
318-325
: memsetExpertIds wrapper is correctThe host function passes the
expertIds
device pointer directly to the CUDA kernel, and the kernel’s implementation writes intoexpertIds
exactly as intended—there is no stray typo or missing offset in the wrapper.– The kernel signature (
int* expertIds, …
) matches the argument passed by the host.
– InsidememsetExpertIdsDevice
, writes toexpertIds
occur at the computed indices.No changes needed.
cpp/tensorrt_llm/kernels/moePrepareKernels.h (3)
43-48
: FIFO structure properly adapted for multi-value storage.The
MoeCommFifoConnInfo
structure now uses a 512-element values array instead of a single count, supporting the per-slot statistics workflow. The volatile qualifier ensures proper memory visibility across threads.
55-66
: Simplified pointer arithmetic in getFifoConnInfo.The method now uses
workspacePtr
directly as the base pointer, simplifying the memory layout. The calculation correctly handles sender/receiver perspective and peer ranking.
71-73
: Updated function signature aligns with implementation.The
computeCountAndIndice
declaration properly includes the newexpertStatics
,gatheredExpertStatics
,slotCount
, andexpertCount
parameters, matching the implementation in the .cu file.
PR_Github #15561 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
tests/unittest/_torch/thop/test_moe_alltoall.py (1)
319-435
: Duplicate FP8 test class — remove the second definition
TestMoeAlltoAllFP8SingleGPU
is defined twice. This will shadow the first definition and can lead to confusion and flakiness in test discovery. Keep only one copy.Apply this diff to remove the duplicate block starting at Line 919:
- class TestMoeAlltoAllFP8SingleGPU(unittest.TestCase): - - def setUp(self): - torch.manual_seed(0x1234) - tllm.logger.set_level('error') - - def test_moe_alltoall_fp8_with_indices(self): - """Test fp8 alltoall with properly constructed indices""" - torch.cuda.set_device(0) - - # Match dimensions from the error - input_entry_count = 16384 - output_entry_count = 16384 - vector_dim = 2944 - sf_vector_dim = 92 # Scaling factor dimension from error - send_recv_count = 1000 # Number of entries to send/receive - - # Create input tensors - first as float16, then convert - input_tensor_fp16 = torch.randn(input_entry_count, - vector_dim, - dtype=torch.float16, - device='cuda') - input_tensor_fp8 = input_tensor_fp16.to(torch.float8_e4m3fn) - - # Scaling factor tensor - input_sf_tensor = torch.randint(1, - 255, (input_entry_count, sf_vector_dim), - dtype=torch.uint8, - device='cuda') - - # Expert selection tensors - input_experts = torch.randint(0, - 64, (input_entry_count, 4), - dtype=torch.int32, - device='cuda') - input_scales = torch.rand(input_entry_count, - 4, - dtype=torch.float32, - device='cuda') - - # Output tensors - output_tensor_fp8 = torch.zeros(output_entry_count, - vector_dim, - dtype=torch.float8_e4m3fn, - device='cuda') - output_sf_tensor = torch.zeros(output_entry_count, - sf_vector_dim, - dtype=torch.uint8, - device='cuda') - output_experts = torch.zeros(output_entry_count, - 4, - dtype=torch.int32, - device='cuda') - output_scales = torch.zeros(output_entry_count, - 4, - dtype=torch.float32, - device='cuda') - - # Construct send/recv indices - send_cumsum = torch.tensor([send_recv_count], - dtype=torch.int32, - device='cuda') - recv_cumsum = torch.tensor([send_recv_count], - dtype=torch.int32, - device='cuda') - - # Random indices for sending - send_indices = torch.randperm(input_entry_count, - dtype=torch.int32, - device='cuda')[:send_recv_count] - recv_indices = torch.randperm(output_entry_count, - dtype=torch.int32, - device='cuda')[:send_recv_count] - - # Create workspace - workspace_size = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(1) - all_workspaces = torch.zeros(1, - workspace_size, - dtype=torch.uint64, - device='cuda') - - print(f"Test configuration:") - print(f" Input entries: {input_entry_count}") - print(f" Vector dim: {vector_dim}") - print(f" SF vector dim: {sf_vector_dim}") - print(f" Send/recv count: {send_recv_count}") - print(f" FP8 tensor shape: {input_tensor_fp8.shape}") - print(f" SF tensor shape: {input_sf_tensor.shape}") - - try: - # Test with all 4 tensors - torch.ops.trtllm.moe_comm([ - input_tensor_fp8, input_sf_tensor, input_experts, input_scales - ], send_cumsum, send_indices, [ - output_tensor_fp8, output_sf_tensor, output_experts, - output_scales - ], recv_cumsum, recv_indices, all_workspaces, 0, 1) - - torch.cuda.synchronize() - print("FP8 alltoall test PASSED!") - - # Verify outputs - print(f"\nOutput verification:") - print(f" Output FP8 shape: {output_tensor_fp8.shape}") - print(f" Output SF shape: {output_sf_tensor.shape}") - print( - f" Non-zero FP8 elements: {(output_tensor_fp8 != 0).sum().item()}" - ) - print( - f" Non-zero SF elements: {(output_sf_tensor != 0).sum().item()}" - ) - - except Exception as e: - print(f"FP8 alltoall test FAILED: {e}") - print(f"Error type: {type(e)}") - raiseAlso applies to: 919-1035
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
70-85
: Bound field count to compile-time maximum to avoid OOBBefore filling
fieldsInfo
, enforcefieldCount <= MOE_COMM_FIELD_MAX_COUNT
to avoid writing past the fixed-size array.int fieldCount = inputs.size(); +tensorrt_llm::kernels::FusedMoeFieldInfo sendFieldInfo, recvFieldInfo; +TORCH_CHECK( + fieldCount <= tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, + "Number of fields (", fieldCount, ") exceeds maximum allowed (", + tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, ")"); -sendFieldInfo.isBasicInterleaved = false; +sendFieldInfo.isBasicInterleaved = false;Note: include or ensure
fusedMoeCommKernels.h
exposesMOE_COMM_FIELD_MAX_COUNT
(already included at Line 19).
🧹 Nitpick comments (5)
tests/unittest/_torch/thop/test_moe_alltoall.py (2)
117-126
: Warmup path uses an uninitialized workspace
do_warmup
allocates a freshall_workspaces
but does not callmoe_initialize_workspace
beforemoe_comm
. Initialize it to mirror the main test path.all_workspaces = torch.zeros(1, workspace_size // 8, dtype=torch.uint64, device=torch.device('cuda')) +torch.ops.trtllm.moe_initialize_workspace(all_workspaces, 0, 1) torch.ops.trtllm.moe_comm(input_tensors, send_cumsum, send_indices, output_tensors, recv_cumsum, recv_indices, all_workspaces, 0, 1)
251-255
: Reference assembly may be fragile across ranks; verify offset usage
output_indice_offset
is accumulated across nested loops overother_rank
. This is correct if each rank’s recv counts are appended contiguously per source rank, but it couples the reference construction to the same concatenation order you use later. If you change concatenation order, this will silently miscompare.If you expect to change concatenation order in future, consider computing per-rank-per-source slices and then concatenating both actual and reference with the same order variable to avoid hidden coupling.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)
190-198
: Top-k divisibility guard contradicts fused path capability
enable_alltoall
still enforcesexperts_per_token % 4 == 0
, but the fused multi-field alltoall path is intended to support arbitrary top_k. This condition will prevent the new path from engaging for non-multiples-of-4.If the fused kernels support arbitrary top_k, drop this guard:
@cached_property def enable_alltoall(self): - return (self.mapping.moe_ep_size > self.routing_method.experts_per_token - and self.routing_method.experts_per_token % 4 == 0 # alltoall without allgather only supports top_k % 4 == 0 + return (self.mapping.moe_ep_size > self.routing_method.experts_per_token and self.mapping.enable_attention_dp and self.mapping.tp_size > 1 and os.environ.get("TRTLLM_MOE_DISABLE_ALLTOALLV", "0") != "1" and MnnvlMemory.supports_mnnvl())If there are known exceptions (e.g., legacy/non-fused kernels), gate them by kernel selection rather than a global condition here.
245-249
: No-op assignment for output_dtype
output_dtype = output_dtype
in theisinstance(x, Fp4QuantizedTensor)
branch is redundant.if isinstance(x, Fp4QuantizedTensor): - assert output_dtype is not None - output_dtype = output_dtype + assert output_dtype is not None else: output_dtype = x.dtypecpp/tensorrt_llm/thop/moeCommOp.cpp (1)
141-207
: moePrepareOp shape/flow is sound; minor cleanupsThe new API returns the cumsums and indices plus optional gathered statics, which aligns with the kernel changes. Two small cleanups:
- Remove unused
preparedLocalExpertIds
allocation.- Consider asserting
slotCount % epSize == 0
to matchexpertCountPerRank = slotCount / epSize
assumption in the kernel.-torch::Tensor preparedLocalExpertIds - = torch::empty({maxTokenCountPerRank * epSize, topK}, expertsIds.options().dtype(torch::kInt32)); +// preparedLocalExpertIds no longer used in the new API; removed. +TORCH_CHECK(slotCount % epSize == 0, "slotCount must be divisible by epSize");
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
(7 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.h
(3 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(6 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(1 hunks)tests/unittest/_torch/thop/test_moe_alltoall.py
(8 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/moePrepareKernels.h
🧠 Learnings (2)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🪛 Ruff (0.12.2)
tests/unittest/_torch/thop/test_moe_alltoall.py
919-919: Redefinition of unused TestMoeAlltoAllFP8SingleGPU
from line 319
(F811)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/unittest/_torch/thop/test_moe_alltoall.py (1)
79-85
: Unify workspace_size units across all FP8 testsThe allocation for
all_workspaces
must consistently use either the rawworkspace_size
(if it already represents the number of uint64 entries) orworkspace_size // 8
(if it returns bytes). In the current file, four blocks divide by 8, but the last block (lines 994–996) does not, risking under/over‐allocation.Please confirm the return units of:
torch.ops.trtllm.get_moe_commworkspace_size_per_rank
torch.ops.trtllm.get_moe_prepare_workspace_size_per_rank
Then standardize allocations in:
- tests/unittest/_torch/thop/test_moe_alltoall.py
• Lines 79–81
• Lines 119–121
• Lines 276–278
• Lines 394–396
• Lines 994–996Suggested diff for the final block (lines 994–996):
- all_workspaces = torch.zeros(1, - workspace_size, + all_workspaces = torch.zeros(1, + workspace_size // 8, dtype=torch.uint64, device='cuda')Apply the same convention to the
get_moe_prepare_workspace_size_per_rank
tests once confirmed.tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
355-370
: Alltoall prepare and fused dispatch look correctThe move to a single fused alltoallv for
[x, x_sf, token_selected_experts, token_final_scales]
with precomputedalltoall_info
and post-comm memset of expert IDs matches the new API and reduces redundant traffic.cpp/tensorrt_llm/kernels/moePrepareKernels.h (2)
43-48
: Per-slot counter array size ties to runtime; check is enforced upstreamThe change to
int volatile values[512]
matches the upstream TORCH_CHECK (expertCount + 1 <= 512
) in moeCommOp.cpp. This is good to prevent OOB indexing.
55-66
: Workspace base pointer arithmetic flattened to workspacePtr; OK but relies on correct stride units
getFifoConnInfo
now derives the FIFO base usingworkspacePtr
andrankStrideInU64
. This assumes the Python side provides the stride in uint64 units. Ensure all callers allocateallWorkspaces
second dimension in units of uint64s per rank, not bytes.If any Python tests are dividing by 8 (bytes→u64), standardize all to the same convention and document it in the op’s Python docstring.
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (3)
126-137
: Sender-side stat emission is clear and boundedThe loop emits the send count (index 0) and, if provided, expert statics (indices 1..expertCount). This aligns with the 512-slot buffer and the upstream check.
160-174
: Receiver-side stat acquisition: relies on proper acquire/release semanticsWith the proposed
CounterCommunicator
fix, therecvCountAndStatics
loop is robust. Without it, this spinning loop may not observe producer writes.Please adopt the release/acquire diff above to prevent rare hangs under load.
332-335
: Prepare workspace size returns bytes; ensure Python allocates in uint64 units
getMoePrepareWorkspaceSize
returnssizeof(MoeCommFifoConnInfo) * epSize
(bytes). Python currently passesallWorkspaces.stride(0)
in units of uint64s. Allocate the second dimension asworkspace_size / sizeof(uint64_t)
to keep stride consistent.I can provide a repo-wide patch to normalize prepare/comm workspace sizing once you confirm intended units for both APIs.
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
126-139
: Workspace initialization path LGTM
moe_initialize_workspace
buildsFusedMoeWorkspace
and initializes it with world info. This aligns with the fused comm path expectations.
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (1)
60-77
: Fix cross-CTA visibility and type mismatch in CounterCommunicator (use s32 acquire/release).Plain volatile stores/loads across CTAs are not sufficient; readers can spin on stale zeros. Also,
value
isuint64_t
whilevalues[]
is typicallyint
(s32). Use release/acquire semantics on s32 and keep types consistent.Apply this diff:
- __forceinline__ __device__ void releaseValue(uint64_t value, int index) + __forceinline__ __device__ void releaseValue(int value, int index) { // Avoid block on 0 - fifoConnInfo->values[index] = value + 1; + int v = value + 1; + // Release store to make writes visible across CTAs/SMs + asm volatile("st.release.sys.global.s32 [%0], %1;" :: "l"(&fifoConnInfo->values[index]), "r"(v) : "memory"); } - __forceinline__ __device__ uint64_t acquireValue(int index) + __forceinline__ __device__ int acquireValue(int index) { - uint64_t localValue = 0; + int localValue = 0; do { - localValue = fifoConnInfo->values[index]; + // Acquire load to observe prior release + asm volatile("ld.acquire.sys.global.s32 %0, [%1];" : "=r"(localValue) : "l"(&fifoConnInfo->values[index]) : "memory"); } while (localValue == 0); fifoConnInfo->values[index] = 0; // reset the value return localValue - 1; }Note: Call sites already operate on ints; this change aligns types and ordering.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
46-51
: Tighten architecture guards: require CUDA_ARCH presence and use parentheses.You changed from the prior incorrect
||
to&&
(good), but the recommended guard is#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= X)
rather thandefined(__CUDACC__) && __CUDA_ARCH__ >= X
. The latter is redundant and risks subtle host-compilation parsing.Apply representative diffs (repeat for all guarded blocks):
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800) asm("mbarrier.init.shared.b64 [%0], %1;" : : "r"(__as_ptr_smem(addr)), "r"(count) : "memory"); #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) asm("mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" : : "r"(__as_ptr_smem(addr)), "r"(txCount) : "memory"); #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800) uint64_t state; asm("mbarrier.arrive.shared.b64 %0, [%1];" : "=l"(state) : "r"(__as_ptr_smem(addr)) : "memory"); return state; #else return 0; #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) uint64_t state; asm("mbarrier.arrive.expect_tx.release.cta.shared::cta.b64 %0, [%1], %2;" : "=l"(state) : "r"(__as_ptr_smem(addr)), "r"(txCount) : "memory"); return state; #else return 0; #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) uint32_t waitComplete; asm("{\n\t .reg .pred P_OUT; \n\t" "mbarrier.try_wait.parity.shared::cta.b64 P_OUT, [%1], %2;\n\t" "selp.b32 %0, 1, 0, P_OUT; \n" "}" : "=r"(waitComplete) : "r"(__as_ptr_smem(addr)), "r"(phaseParity) : "memory"); return static_cast<bool>(waitComplete); #else return false; #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800) asm volatile( "{\n" " .reg .pred p;\n" " setp.ne.b32 p, %0, 0;\n" " @p cp.async.ca.shared.global [%1], [%2], %3;\n" "}\n" ::"r"((int) predGuard), "r"(__as_ptr_smem(dstShm)), "l"(__as_ptr_gmem(srcMem)), "n"(COPY_SIZE)); #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800) asm volatile("cp.async.commit_group;" : : :); #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800) asm volatile("cp.async.wait_group %0;" : : "n"(N) : "memory"); #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) asm("cp.async.bulk.shared::cta.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" : : "r"(__as_ptr_smem(dstMem)), "l"(__as_ptr_gmem(srcMem)), "r"(copySize), "r"( __as_ptr_smem(smemBar)) : "memory"); #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) asm("cp.async.bulk.global.shared::cta.bulk_group [%0], [%1], %2;" : : "l"(__as_ptr_gmem(dstMem)), "r"(__as_ptr_smem(srcMem)), "r"(copySize) : "memory"); #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) asm volatile("cp.async.bulk.commit_group;" : : :); #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) asm volatile("cp.async.bulk.wait_group %0;" : : "n"(N) : "memory"); #endif @@ -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) asm volatile("cp.async.bulk.wait_group.read %0;" : : "n"(N) : "memory"); #endifAlso applies to: 53-61, 63-73, 74-86, 88-103, 105-117, 119-124, 126-132, 134-142, 144-152, 154-159, 161-167, 169-175
🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (1)
93-94
: Confirm slotCount divisibility by epSize.
expertCountPerRank = slotCount / epSize
assumes even partitioning; any remainder will corrupt rank mapping when doingexperts[...] / expertCountPerRank
. Validate this at the host before launching the kernel.Add a host-side precondition in
computeCountAndIndice
:void computeCountAndIndice( /* args... */ ) { + TLLM_CHECK_WITH_INFO(slotCount % rankCount == 0, "slotCount (%d) must be divisible by epSize (%d).", + slotCount, rankCount); // existing code ... }cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
1298-1303
: Static caching of max dynamic SMEM may be device-dependent.
static int maxDynamicShmSize = computeMoeAlltoallMaxDynamicSharedMemorySize();
will cache the first device’s value. If the active device can change, recompute per call or per device.Apply this diff:
- static int maxDynamicShmSize = fused_moe_impl::computeMoeAlltoallMaxDynamicSharedMemorySize(); - int groupCountPerCta = std::min(maxGroupCountPerCta, maxDynamicShmSize / warpShmSize); + int maxDynamicShmSize = fused_moe_impl::computeMoeAlltoallMaxDynamicSharedMemorySize(); + int groupCountPerCta = std::min(maxGroupCountPerCta, maxDynamicShmSize / warpShmSize);
1322-1326
: Optional: avoid magic number 48*1024.Use a named constexpr (e.g.,
kDEFAULT_SMEM_PER_BLOCK = 48 * 1024
) or query device attribute to decide when to opt-in.- if (groupCountPerCta * warpShmSize > 48 * 1024) + constexpr int kDEFAULT_SMEM_PER_BLOCK = 48 * 1024; + if (groupCountPerCta * warpShmSize > kDEFAULT_SMEM_PER_BLOCK) { TLLM_CUDA_CHECK(cudaFuncSetAttribute( kernelFn, cudaFuncAttributeMaxDynamicSharedMemorySize, groupCountPerCta * warpShmSize)); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(1 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.cu
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
🧠 Learnings (2)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-14T15:36:37.576Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/kernels/mlaKernels.cu:436-439
Timestamp: 2025-08-14T15:36:37.576Z
Learning: CUDA kernels prioritize performance and should avoid runtime bounds checking or conditional operations that cause branching/warp divergence. Input validation should be done at the host level before kernel launch, not per-thread in the kernel.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (5)
126-137
: Use consistent per-slot send-count semantics and avoid data races.This producer writes both the per-target
sendCounts[targetRankId]
and publishes values through the FIFO. OnceCounterCommunicator
uses release semantics (previous comment), the consumer will see consistent values. Without those semantics, the consumer may read zero or stale values. Ensure the acquire/release fix is applied before relying on these values.
140-143
: Consumer needs acquire semantics and explicit reset ordering.
acquireValue
currently does plain loads and resets the slot. Once changed to acquire-load (previous comment), therecvValue
path is fine. Keep the reset as a relaxed store after consumption as you have it. No additional barriers needed here.Also applies to: 160-174
267-271
: Kernel specialization and launch selection look good.TopK tiling and function pointer selection is clear and bounded by
1 <= topK <= 32
. Grid sizing for senders vs receivers is coherent.Also applies to: 300-302
322-329
: memsetExpertIds now writes to expertIds (not recvExperts) — verify downstream assumptions.Switching the destination to
expertIds
looks intentional for padding withslotCount
. Ensure downstream kernels and host code treatslotCount
as an out-of-range sentinel and not a valid slot.
331-334
: Verify MoeCommFifoConnInfo layout before sizing workspaceFile: cpp/tensorrt_llm/kernels/moePrepareKernels.cu
Lines: 331-334size_t getMoePrepareWorkspaceSize(int epSize) { return sizeof(MoeCommFifoConnInfo) * epSize; }Manual validation required:
Sizing the workspace assizeof(MoeCommFifoConnInfo) * epSize
only covers the struct’s header. It assumes that:
MoeCommFifoConnInfo::values
is an in-place, fixed-size array large enough forcommunicationCount = expertCount + 1
slots, and- All calls to
getFifoConnInfo(..., slotIndex, slotCount)
useslotIndex
within that array’s bounds.I wasn’t able to locate a local definition of
MoeCommFifoConnInfo
—please confirm that both conditions hold. Ifvalues
is instead a flexible array member or externally allocated, update this function to include the full per-slot storage, for example:sizeof(MoeCommFifoConnInfo) + sizeof(ValueType) * (expertCount + 1)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (3)
352-361
: Verify LL128 finish-sentinel logic.
finish = (value == (step & (1ULL << 63ULL)))
compares against either 0 or 0x8000... depending on step’s MSB, but I don’t see a write path that stores such a sentinel. If finish packets are intended, clarify the encoding (e.g.,value == (step | (1ULL << 63))
or a dedicated tag) and add the matching write in protoPack or the sender path.Would you like me to draft a consistent finish-sentinel scheme for LL128 and wire it through pack/unpack?
892-901
: Receiver updates both sender and receiver tails — confirm intended ownership.
updateReadEntry()
writesmReceiverSideFifoInfo->tail
andmSenderSideFifoInfo->tail
. If sender-side tail is supposed to be updated only by the sender, this can cause contention or invariants violations in multi-SM/multi-CTA settings. If both infos are colocated in the same address space (loopback/local), this may be deliberate. Please confirm.
1616-1693
: Local FIFO test harness is clear and useful.The local send/recv harness parameterization (HAS_BASIC_FIELD template, dynamic shared memory sizing, index mappings) looks solid and should aid regression testing.
PR_Github #15571 [ run ] triggered by Bot |
PR_Github #15571 [ run ] completed with state |
b47166c
to
c9360c9
Compare
/bot run |
PR_Github #15609 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
174-183
: Validate expertsStatics when providedIf expertsStatics is present, ensure dtype and shape are as expected before passing raw pointers into kernels.
- if (expertsStatics.has_value()) + if (expertsStatics.has_value()) { + CHECK_INPUT(expertsStatics.value(), torch::kInt32); + TORCH_CHECK( + expertsStatics.value().dim() == 1 && expertsStatics.value().size(0) == expertCount, + "experts_statics must be a 1D int32 tensor of length expert_count (", expertCount, ")"); localExpertStaticsPtr = expertsStatics.value().data_ptr<int>(); gatheredExpertStatics = torch::empty({epSize, expertCount}, expertsIds.options().dtype(torch::kInt32)); gatheredExpertStaticsPtr = gatheredExpertStatics.value().data_ptr<int>(); }
♻️ Duplicate comments (9)
tests/unittest/_torch/thop/test_moe_alltoall.py (1)
319-435
: Remove duplicate test class definition.The
TestMoeAlltoAllFP8SingleGPU
class is defined twice (lines 319 and 919), which will cause a redefinition error. The second definition should be removed.Remove the duplicate class definition starting at line 919:
-class TestMoeAlltoAllFP8SingleGPU(unittest.TestCase): - - def setUp(self): - torch.manual_seed(0x1234) - tllm.logger.set_level('error') - - def test_moe_alltoall_fp8_with_indices(self): - """Test fp8 alltoall with properly constructed indices""" - torch.cuda.set_device(0) - - # Match dimensions from the error - input_entry_count = 16384 - output_entry_count = 16384 - vector_dim = 2944 - sf_vector_dim = 92 # Scaling factor dimension from error - send_recv_count = 1000 # Number of entries to send/receive - - # ... rest of the duplicate implementation ...Also applies to: 919-1035
tensorrt_llm/_mnnvl_utils.py (1)
529-599
: Fix return type inconsistency.When
x
is a single tensor input, the function converts it to a list for unified handling. However, at line 597, when returning for a single tensor case, it returns[result[0]]
which is still a list containing one element. According to the function signature-> Union[torch.Tensor, List[torch.Tensor]]
, it should return a single tensor when the input was a single tensor.Apply this diff to fix the return type:
- # If input was a single tensor, return a single tensor + # If input was a single tensor, return a single tensor if is_single_tensor: - result = [result[0]] + result = result[0] return resultAdditionally, update the type hints to reflect support for None entries:
-def mnnvl_moe_alltoallv( - x: Union[torch.Tensor, List[torch.Tensor]], +def mnnvl_moe_alltoallv( + x: Union[torch.Tensor, List[Optional[torch.Tensor]]], alltoall_info: MoEAlltoallInfo, workspace: torch.Tensor, ep_rank: int, ep_size: int, - ) -> Union[torch.Tensor, List[torch.Tensor]]: + ) -> Union[torch.Tensor, List[Optional[torch.Tensor]]]:cpp/tensorrt_llm/kernels/moePrepareKernels.cu (1)
60-77
: Memory ordering concerns with volatile operations.The
CounterCommunicator
class uses plain volatile stores/loads for inter-CTA communication. While the diff shows type consistency withuint64_t
values andint
storage, the lack of proper memory ordering could lead to visibility issues where CTAs spin on stale values.Consider using proper release/acquire semantics:
__forceinline__ __device__ void releaseValue(uint64_t value, int index) { // Avoid block on 0 - fifoConnInfo->values[index] = value + 1; + int v = static_cast<int>(value + 1); + // Release store to make writes visible across CTAs + asm volatile("st.release.sys.global.s32 [%0], %1;" :: "l"(&fifoConnInfo->values[index]), "r"(v) : "memory"); } __forceinline__ __device__ uint64_t acquireValue(int index) { - uint64_t localValue = 0; + int localValue = 0; do { - localValue = fifoConnInfo->values[index]; + // Acquire load to observe prior release + asm volatile("ld.acquire.sys.global.s32 %0, [%1];" : "=r"(localValue) : "l"(&fifoConnInfo->values[index])); } while (localValue == 0); fifoConnInfo->values[index] = 0; // reset the value - return localValue - 1; + return static_cast<uint64_t>(localValue - 1); }cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
46-175
: Fix architecture guards to prevent unsupported PTX on lower compute capabilities.The current preprocessor guards use
#if defined(__CUDACC__) && __CUDA_ARCH__ >= XXX
, but this is incorrect. Thedefined(__CUDACC__)
check is for host-side compilation detection, not device-side. This could allow unsupported PTX instructions to be compiled for lower compute capabilities.Apply these fixes to ensure proper architecture guards:
__device__ __forceinline__ void mbarrier_init(uint64_t* addr, uint32_t const& count) { -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800) asm("mbarrier.init.shared.b64 [%0], %1;" : : "r"(__as_ptr_smem(addr)), "r"(count) : "memory"); #endif } __device__ __forceinline__ void mbarrier_expect_tx(uint64_t* addr, const uint32_t txCount) { -#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900 +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) asm("mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" : : "r"(__as_ptr_smem(addr)), "r"(txCount) : "memory"); #endif }Apply similar fixes to all other architecture-gated functions (mbarrier_arrive, mbarrier_arrive_expect_tx, mbarrier_try_wait_parity, ldgsts, cp_async_commit_group, cp_async_wait_group, cp_async_bulk_g2s, cp_async_bulk_s2g, cp_async_bulk_commit_group, cp_async_bulk_wait_group, cp_async_bulk_wait_group_read).
cpp/tensorrt_llm/thop/moeCommOp.cpp (3)
209-227
: Fix misleading error text and strengthen guards in memsetExpertIdsThe dim check says “1D” while enforcing 2D. Also add basic param guards.
void memsetExpertIds(torch::Tensor expertsIds, torch::Tensor recvRankCountCumSum, int64_t maxTokenCountPerRank, int64_t topK, int64_t slotCount, int64_t epSize) { CHECK_INPUT(expertsIds, torch::kInt32); - TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 1D tensor"); + TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 2D tensor [max_token_count_per_rank*ep_size, top_k]"); TORCH_CHECK( expertsIds.size(0) == maxTokenCountPerRank * epSize, "expertsIds must have maxTokenCountPerRank * epSize rows"); TORCH_CHECK(expertsIds.size(1) == topK, "expertsIds must have topK columns"); CHECK_INPUT(recvRankCountCumSum, torch::kInt32); TORCH_CHECK(recvRankCountCumSum.dim() == 1, "recvRankCountCumSum must be a 1D tensor"); TORCH_CHECK(recvRankCountCumSum.size(0) == epSize, "recvRankCountCumSum must have epSize elements"); + TORCH_CHECK(maxTokenCountPerRank > 0 && topK > 0 && slotCount > 0 && epSize > 0, + "max_token_count_per_rank, top_k, slot_count, ep_size must be positive"); auto stream = at::cuda::getCurrentCUDAStream();
70-76
: Enforce fieldCount ≤ MOE_COMM_FIELD_MAX_COUNT to avoid out-of-bounds on fieldsInfo[]fieldsInfo has a compile-time max (8). Guard fieldCount to prevent OOB writes.
- int fieldCount = inputs.size(); + int fieldCount = inputs.size(); + TORCH_CHECK( + fieldCount <= tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, + "Number of fields (", fieldCount, ") exceeds maximum allowed (", + tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, ")");
95-99
: topK is not initialized; pass 0 explicitly (no basic fields) to fillMetaInfoparams.expertParallelInfo.topK is never set. Passing its default to fillMetaInfo is fragile; for fused comm here we don’t use basic fields. Pass 0 explicitly and set the struct for clarity.
params.recvFieldInfo = recvFieldInfo; // Do not need expertParallelInfo for fused moe comm now + params.expertParallelInfo.topK = 0; - params.sendFieldInfo.fillMetaInfo(&(params.sendCommMeta), params.expertParallelInfo.topK, false, false); - params.recvFieldInfo.fillMetaInfo(&(params.recvCommMeta), params.expertParallelInfo.topK, false, false); + params.sendFieldInfo.fillMetaInfo(&(params.sendCommMeta), /*topK=*/0, false, false); + params.recvFieldInfo.fillMetaInfo(&(params.recvCommMeta), /*topK=*/0, false, false);cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (2)
1-15
: Switch to SPDX header format (consistent with project headers)Replace the legacy Apache header with SPDX lines as used elsewhere.
-/* - * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ +/* + * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-License-Identifier: Apache-2.0 + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */
148-151
: random_shuffle removed in C++17; use std::shuffle with a real RNGThis will not compile under C++17. Replace with std::shuffle and drop the extra std::srand.
// Shuffle unused indices for random assignment - std::srand(42); // Fixed seed for reproducible tests - std::random_shuffle(unusedIndices.begin(), unusedIndices.end()); + std::mt19937 gen(42); // Fixed seed for reproducible tests + std::shuffle(unusedIndices.begin(), unusedIndices.end(), gen);
🧹 Nitpick comments (10)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
465-475
: Clarify statistics gathering on the last callThe implementation accumulates local expert statistics on every chunk (and repeat) via
update_local_statistic
(whenupdates_enabled
), storing counts inlocal_statistic_tensor
, and only callsget_local_statistic_tensor
whenis_last_call
to retrieve the fully aggregated tensor for the subsequent AlltoAll. This avoids redundant communications while ensuring a complete view across all repeats. Static routing (is_static_routing == True
) correctly skips both updates and gathering.• File
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
, around line 465:
Add a comment above theif is_last_call and …
block explaining that
local_statistic_tensor
has accumulated stats from all prior calls, and- gathering is deferred to the final call by design to minimize communication overhead.
No changes to the logic itself are required.
cpp/tensorrt_llm/thop/moeCommOp.cpp (3)
81-88
: Optional: Validate each input/output tensor is CUDA and contiguoussetMoeCommFieldInfo asserts 2D shape, but neither CUDA placement nor contiguity are asserted. A bad placement/stride can miscompute element sizes and lead to memory faults at kernel time.
for (int i = 0; i < fieldCount; i++) { - setMoeCommFieldInfo(sendFieldInfo.fieldsInfo[i], inputs[i]); - setMoeCommFieldInfo(recvFieldInfo.fieldsInfo[i], outputs[i]); + TORCH_CHECK(inputs[i].is_cuda() && outputs[i].is_cuda(), "inputs/outputs must be CUDA tensors"); + TORCH_CHECK(inputs[i].is_contiguous() && outputs[i].is_contiguous(), "inputs/outputs must be contiguous"); + setMoeCommFieldInfo(sendFieldInfo.fieldsInfo[i], inputs[i]); + setMoeCommFieldInfo(recvFieldInfo.fieldsInfo[i], outputs[i]); }
153-155
: Remove unused preparedLocalExpertIds allocationThis tensor is allocated but never used. Drop it to save memory and avoid confusion.
- torch::Tensor preparedLocalExpertIds - = torch::empty({maxTokenCountPerRank * epSize, topK}, expertsIds.options().dtype(torch::kInt32)); + // preparedLocalExpertIds removed: not used in fused prepare workflow
157-158
: Nit: Use lowerCamelCase for local variable namesRecvRankCountCumSum uses uppercased initial letter, while other locals are lowerCamelCase. Consider recvRankCountCumSum for consistency.
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (6)
49-58
: Avoid C RNG in tests; don’t seed global stateUse a local RNG to avoid global state and improve reproducibility. Remove std::srand here.
void SetUp() override { if (shouldSkip()) { skipped = true; GTEST_SKIP() << "Skipping due to no/unsupported GPU"; } TLLM_CUDA_CHECK(cudaStreamCreate(&stream)); - std::srand(42); // Initialize random seed }
68-70
: Add a per-fixture RNGUse a member engine to power all random data generation paths.
bool skipped = false; cudaStream_t stream = nullptr; + std::mt19937 rng{42}; // Fixed seed for reproducible tests
88-104
: Replace rand() with distributionsrand() is non-deterministic across platforms and discouraged. Use the fixture RNG with type-appropriate distributions.
else { // Default initialization with random values - for (size_t i = 0; i < count; i++) - { - if constexpr (std::is_same_v<T, float>) - { - (*hostPtr)[i] = static_cast<float>(rand()) / RAND_MAX * 10.0f; - } - else if constexpr (std::is_same_v<T, int>) - { - (*hostPtr)[i] = rand() % 1000; - } - else - { - (*hostPtr)[i] = static_cast<T>(rand() % 100); - } - } + std::uniform_real_distribution<float> floatDist(0.0f, 10.0f); + std::uniform_int_distribution<int> intDist(0, 999); + std::uniform_int_distribution<int> byteDist(0, 99); + for (size_t i = 0; i < count; i++) + { + if constexpr (std::is_same_v<T, float>) + { + (*hostPtr)[i] = floatDist(rng); + } + else if constexpr (std::is_same_v<T, int>) + { + (*hostPtr)[i] = intDist(rng); + } + else + { + (*hostPtr)[i] = static_cast<T>(byteDist(rng)); + } + } }
153-163
: Handle exhausted unusedIndices and fix signed/unsigned mismatchPrevent leaving -1 in fullMapping if unusedIndices runs out; also use size_t for the index to avoid signed/unsigned warnings.
- int unusedIdx = 0; + size_t unusedIdx = 0; for (int i = 0; i < totalSize; i++) { if (i < providedSize && fullMapping[i] == -1) { // Fix invalid mapping - if (unusedIdx < unusedIndices.size()) + if (unusedIdx < unusedIndices.size()) { fullMapping[i] = unusedIndices[unusedIdx++]; } + else + { + // No more unused indices available, use identity as fallback + fullMapping[i] = i; + } } else if (i >= providedSize) { // Extend mapping if (unusedIdx < unusedIndices.size()) { fullMapping[i] = unusedIndices[unusedIdx++]; } else { // Fallback: identity mapping for remaining fullMapping[i] = i; } }
473-476
: Avoid strict-aliasing UB when packing float into intUse memcpy instead of pointer punning.
else if (hasScales && offsetInWarp < topK * 2) { // Scales area - float scale - = 1.0f + static_cast<float>(warpIdx) * 0.1f + static_cast<float>(offsetInWarp - topK) * 0.01f; - return *reinterpret_cast<int*>(&scale); + float scale = 1.0f + static_cast<float>(warpIdx) * 0.1f + + static_cast<float>(offsetInWarp - topK) * 0.01f; + int packed; + std::memcpy(&packed, &scale, sizeof(packed)); + return packed; }
757-767
: Optional: Prefer RAII for host buffers to simplify cleanupUsing std::vector or smart pointers prevents leaks on early returns and is clearer.
- if (resultRecvTokenSlots) - delete[] resultRecvTokenSlots; - if (resultRecvScales) - delete[] resultRecvScales; - for (int i = 0; i < fieldCount; i++) - { - if (resultRecvFields[i]) - delete[] resultRecvFields[i]; - } + std::unique_ptr<int[]> resultRecvTokenSlotsPtr(resultRecvTokenSlots); + std::unique_ptr<float[]> resultRecvScalesPtr(resultRecvScales); + for (int i = 0; i < fieldCount; i++) + { + std::unique_ptr<uint8_t[]> fieldGuard(resultRecvFields[i]); + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (16)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(1 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(1 hunks)cpp/tensorrt_llm/kernels/moeCommKernels.cu
(0 hunks)cpp/tensorrt_llm/kernels/moeCommKernels.h
(0 hunks)cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
(1 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.cu
(7 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.h
(3 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(6 hunks)cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
(0 hunks)cpp/tests/unit_tests/kernels/CMakeLists.txt
(1 hunks)cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
(1 hunks)tensorrt_llm/_mnnvl_utils.py
(5 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(0 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(5 hunks)tests/unittest/_torch/thop/test_moe_alltoall.py
(8 hunks)
💤 Files with no reviewable changes (4)
- cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
- tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
- cpp/tensorrt_llm/kernels/moeCommKernels.cu
- cpp/tensorrt_llm/kernels/moeCommKernels.h
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
- cpp/tests/unit_tests/kernels/CMakeLists.txt
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
- cpp/tensorrt_llm/kernels/moePrepareKernels.h
- cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_mnnvl_utils.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/unittest/_torch/thop/test_moe_alltoall.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/thop/moeCommOp.cpp
tensorrt_llm/_mnnvl_utils.py
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
🧠 Learnings (4)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tests/unittest/_torch/thop/test_moe_alltoall.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
📚 Learning: 2025-08-14T15:36:37.576Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/kernels/mlaKernels.cu:436-439
Timestamp: 2025-08-14T15:36:37.576Z
Learning: CUDA kernels prioritize performance and should avoid runtime bounds checking or conditional operations that cause branching/warp divergence. Input validation should be done at the host level before kernel launch, not per-thread in the kernel.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files
Applied to files:
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
🧬 Code Graph Analysis (6)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (10)
uint64_t
(494-499)int
(364-384)BYTES_PER_128B_BLOCK
(69-107)SenderSideFifoInfo
(30-34)ReceiverSideFifoInfo
(36-40)getLaunchBlockDim
(315-318)getLaunchGridDim
(320-329)setMaxUsableSmCount
(265-277)getMoeCommChannelCount
(302-313)computeWorkspaceSizePreRank
(520-526)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (3)
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1)
kernels
(22-46)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
kernels
(27-207)computeWorkspaceSizePreRank
(520-526)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (12)
tokenCount
(919-1003)tokenCount
(919-919)tokenCount
(1015-1126)tokenCount
(1015-1015)launchSingleG2S
(1439-1454)launchSingleG2S
(1439-1440)launchSingleS2G
(1492-1507)launchSingleS2G
(1492-1493)launchLoopback
(1599-1620)launchLoopback
(1599-1601)launchLocalFifoSendRecv
(1661-1693)launchLocalFifoSendRecv
(1661-1664)
cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (8)
constructWorkspace
(1369-1374)constructWorkspace
(1369-1369)moeAllToAll
(1290-1352)moeAllToAll
(1290-1290)getFusedMoeCommWorkspaceSize
(1362-1367)getFusedMoeCommWorkspaceSize
(1362-1362)initializeFusedMoeLocalWorkspace
(1376-1379)initializeFusedMoeLocalWorkspace
(1376-1376)cpp/tensorrt_llm/kernels/moePrepareKernels.cu (2)
memsetExpertIds
(322-329)memsetExpertIds
(322-323)
tensorrt_llm/_mnnvl_utils.py (2)
tensorrt_llm/_torch/distributed/communicator.py (1)
tp_size
(46-47)tensorrt_llm/python_plugin.py (4)
workspace
(331-333)workspace
(336-338)shape
(246-247)shape
(250-255)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (3)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)
topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)cpp/include/tensorrt_llm/common/cudaUtils.h (1)
getMultiProcessorCount
(393-400)cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)
MoeCommFifoConnInfo
(43-48)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_mnnvl_utils.py (8)
MnnvlMemory
(53-336)initialize
(91-100)MnnvlMoe
(350-628)get_moe_workspaces
(358-373)get_moe_prepare_workspace
(376-387)mnnvl_moe_alltoallv_prepare_without_allgather
(399-443)MoEAlltoallInfo
(340-347)mnnvl_moe_alltoallv
(528-599)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (3)
is_static_routing
(350-351)is_static_routing
(759-761)get_local_statistic_tensor
(563-577)
🪛 Ruff (0.12.2)
tests/unittest/_torch/thop/test_moe_alltoall.py
919-919: Redefinition of unused TestMoeAlltoAllFP8SingleGPU
from line 319
(F811)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
465-465: Line too long (124 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (25)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)
196-201
: LGTM! Workspace initialization logic is well-structured.The workspace initialization for MNNVL mode now unconditionally retrieves both the main workspace and prepare workspace, which aligns with the two-stage alltoall flow introduced in this PR.
298-301
: LGTM! Simplified MNNVL alltoall enablement.The method now always returns
True
for MNNVL mode, which is consistent with the new requirement that MNNVL always uses the alltoall path.
580-585
: LGTM! Clean two-stage alltoall dispatch implementation.The dispatch phase correctly passes the required parameters including the
alltoall_info
from the prepare stage, maintaining the two-stage flow architecture.
867-878
: LGTM! Well-structured alltoall_prepare method.The method correctly delegates to the MNNVL prepare function and returns the required tuple of
(token_selected_slots, gathered_local_statistic_tensor, alltoall_info)
for the two-stage flow.
879-895
: LGTM! Complete alltoall_dispatch implementation.The dispatch method properly handles the alltoall operation and updates expert IDs using the new
memset_expert_ids
operation, completing the two-stage flow.tests/unittest/_torch/thop/test_moe_alltoall.py (5)
81-84
: Consistent workspace sizing across test cases.The workspace initialization now uses
workspace_size // 8
for uint64 allocation, which is consistent with the new multi-tensor support. The subsequent call tomoe_initialize_workspace
properly initializes the workspace for rank 0.
86-88
: LGTM! Correct multi-tensor API usage.The test correctly wraps input and output tensors in lists to match the new multi-tensor API, ensuring backward compatibility for single-tensor cases.
117-126
: Consistent multi-tensor handling in warmup.The warmup function correctly uses the list-based API for tensors, maintaining consistency with the main test logic.
160-180
: Well-structured multi-tensor test setup.The test properly creates separate tensors for each dimension in
vector_dims
, preparing input, output, and reference tensors for each, which thoroughly exercises the multi-tensor path.
282-284
: Proper per-rank workspace initialization.Each rank's workspace is initialized with
moe_initialize_workspace
, ensuring proper setup for multi-rank communication.tensorrt_llm/_mnnvl_utils.py (2)
369-372
: LGTM! Proper workspace initialization sequence.The workspace initialization correctly calls
moe_initialize_workspace
with the appropriate rank and size parameters, followed by a barrier to ensure all ranks complete initialization before proceeding.
616-616
: LGTM! Consistent list-based API for combine operation.The moe_comm calls now correctly wrap tensors in lists, aligning with the multi-tensor API pattern introduced throughout the codebase.
Also applies to: 619-619
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (4)
84-138
: LGTM! Clean per-slot statistics implementation.The refactored
computeCountAndSendStatics
function properly handles per-slot statistics distribution, with clear separation between count and expert statistics communication.
140-182
: LGTM! Consistent receiver-side statistics handling.The
recvCountAndStatics
function correctly mirrors the sender-side implementation, properly handling both count and expert statistics reception.
322-329
: LGTM! Simplified memset operation.The
memsetExpertIds
function correctly uses the expert IDs directly as the destination, aligning with the simplified per-slot workflow.
333-334
: LGTM! Clean workspace size calculation.The workspace size calculation is now simplified to
sizeof(MoeCommFifoConnInfo) * epSize
, which is cleaner and more maintainable than the previous META_SIZE/FIFO-based formula.cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (6)
219-327
: LGTM! Well-designed Lamport protocol implementation.The
LamportProto
class provides a clean implementation of the Lamport protocol with proper invalid value handling, pack/unpack operations, and rearm functionality. The in-place nature of the protocol (no additional overhead) is efficiently implemented.
329-477
: LGTM! Efficient LL128 protocol implementation.The
Ll128Proto
class provides an efficient 128-byte aligned protocol with proper tail handling. The 15:16 packing ratio is well-implemented with clear logic for flag placement and tail value management.
807-1156
: LGTM! Comprehensive SingleChannelCommunicator implementation.The
SingleChannelCommunicator
template class provides a complete implementation for managing per-channel FIFO transfers with proper synchronization, entry management, and protocol handling. The separation of sender and receiver logic is clean and maintainable.
1158-1223
: LGTM! Well-structured kernel with proper field count specialization.The
moeAllToAllKernel
properly handles variable field counts through template specialization and correctly manages sender/receiver differentiation.
1276-1288
: LGTM! Proper workspace initialization.The
initializeLocalWorkspace
method correctly initializes the FIFO with the protocol's initialized value and zeros out the info structures.
1290-1352
: LGTM! Robust kernel launch configuration.The
moeAllToAll
function properly determines the optimal group count per CTA based on available shared memory and occupancy constraints, with appropriate fallback logic.cpp/tensorrt_llm/thop/moeCommOp.cpp (3)
100-103
: Confirm rankStrideInU64 is in 64-bit element unitsconstructWorkspace expects rankStrideInU64 measured in uint64 units. With the dtype check added earlier, allWorkspaces.stride(0) now correctly represents u64 elements. No change needed beyond the dtype check.
190-195
: Good parameter threading into computeCountAndIndiceArguments passed match kernel signature and earlier allocations. With the added shape/dtype checks, this call site looks solid.
205-207
: Tuple return order is clear and aligns with the new fused workflowThe new order foregrounds send/recv counts and gathered indices. LGTM.
PR_Github #15609 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (6)
cpp/tensorrt_llm/thop/moeCommOp.cpp (6)
38-46
: Validate dtype forallWorkspaces
beforedata_ptr<uint64_t>()
.We dereference
allWorkspaces
as 64-bit elements later. Add a dtype check to prevent UB if callers pass a different dtype.Apply this diff:
CHECK_INPUT(sendRankCumSum, torch::kInt32); CHECK_INPUT(sendIndiceTensor, torch::kInt32); CHECK_INPUT(recvRankCumSum, torch::kInt32); CHECK_INPUT(recvIndiceTensor, torch::kInt32); + CHECK_INPUT(allWorkspaces, torch::kInt64);
126-139
: Initialize workspace: validateallWorkspaces
dtype.Same precondition as moeCommOp; ensure
allWorkspaces
holds 64-bit words before takingdata_ptr<uint64_t>()
.Apply this diff:
void initializeMoeWorkspace(torch::Tensor allWorkspaces, int64_t epRank, int64_t epSize) { TORCH_CHECK(allWorkspaces.dim() == 2, "allWorkspaces must be a 2D tensor"); TORCH_CHECK(epRank >= 0 && epRank < epSize, "epRank must be in the range [0, epSize)"); + CHECK_INPUT(allWorkspaces, torch::kInt64);
141-151
: Add shape/dtype guards forexpertsIds
andallWorkspaces
.
computeCountAndIndice
expects[tokenCount, topK]
expertsIds and 64-bit workspaces.Apply this diff:
CHECK_INPUT(expertsIds, torch::kInt32); + TORCH_CHECK(expertsIds.dim() == 2, "experts_ids must be a 2D tensor [token_count, top_k]"); + TORCH_CHECK( + expertsIds.size(1) == topK, + "experts_ids second dimension must equal top_k"); + CHECK_INPUT(allWorkspaces, torch::kInt64);
209-221
: Fix misleading tensor-dim error message.The check enforces 2D but the message says 1D. Correct the message to avoid confusion.
Apply this diff:
- TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 1D tensor"); + TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 2D tensor [maxTokenCountPerRank*epSize, topK]");
70-76
: Guard against overflowing thefieldsInfo
array.
fieldCount
must not exceed the compile-time capacity ofFusedMoeFieldInfo::fieldsInfo
. Add a check againstMOE_COMM_FIELD_MAX_COUNT
.Apply this diff:
- int fieldCount = inputs.size(); + int fieldCount = static_cast<int>(inputs.size()); + TORCH_CHECK( + fieldCount <= tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, + "Number of fields exceeds MOE_COMM_FIELD_MAX_COUNT");
97-99
:params.expertParallelInfo.topK
is uninitialized before use.You’re passing an indeterminate value into
fillMetaInfo
. Either set it or pass a known constant (0 if unused in fused comm meta).Apply this diff:
- params.sendFieldInfo.fillMetaInfo(&(params.sendCommMeta), params.expertParallelInfo.topK, false, false); - params.recvFieldInfo.fillMetaInfo(&(params.recvCommMeta), params.expertParallelInfo.topK, false, false); + params.sendFieldInfo.fillMetaInfo(&(params.sendCommMeta), /*topK=*/0, false, false); + params.recvFieldInfo.fillMetaInfo(&(params.recvCommMeta), /*topK=*/0, false, false);Optionally also initialize the struct field explicitly before the calls:
+ params.expertParallelInfo.topK = 0;
🧹 Nitpick comments (5)
cpp/tensorrt_llm/thop/moeCommOp.cpp (5)
2-2
: Update copyright year to include 2025.Repo guideline says to prepend the current year; header still shows 1993-2024.
Apply this diff:
- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
86-87
: Clarify magic arguments with inline comments.Use inline C comments to document the meaning of positional booleans/integers per style guide.
Apply this diff:
- sendFieldInfo.fillFieldPlacementInfo(0, false); - recvFieldInfo.fillFieldPlacementInfo(0, false); + sendFieldInfo.fillFieldPlacementInfo(/*groupOffset=*/0, /*isBroadcast=*/false); + recvFieldInfo.fillFieldPlacementInfo(/*groupOffset=*/0, /*isBroadcast=*/false);
157-205
: Naming consistency: renameRecvRankCountCumSum
to lowerCamelCase.Local variables should be lowerCamelCase. Rename and update the uses in kernel calls and return tuple.
Apply this diff:
- torch::Tensor RecvRankCountCumSum = torch::empty({epSize}, expertsIds.options().dtype(torch::kInt32)); + torch::Tensor recvRankCountCumSum = torch::empty({epSize}, expertsIds.options().dtype(torch::kInt32)); @@ - tensorrt_llm::kernels::moe_prepare::computeCountAndIndice(expertsIds.data_ptr<int>(), - sendRankCountCumSum.data_ptr<int>(), RecvRankCountCumSum.data_ptr<int>(), sendRankIndices.data_ptr<int>(), + tensorrt_llm::kernels::moe_prepare::computeCountAndIndice(expertsIds.data_ptr<int>(), + sendRankCountCumSum.data_ptr<int>(), recvRankCountCumSum.data_ptr<int>(), sendRankIndices.data_ptr<int>(), @@ - tensorrt_llm::kernels::moe_prepare::computeCumsum( - sendRankCountCumSum.data_ptr<int>(), RecvRankCountCumSum.data_ptr<int>(), epRank, epSize, stream); + tensorrt_llm::kernels::moe_prepare::computeCumsum( + sendRankCountCumSum.data_ptr<int>(), recvRankCountCumSum.data_ptr<int>(), epRank, epSize, stream); @@ - tensorrt_llm::kernels::moe_prepare::moveIndice(sendRankCountCumSum.data_ptr<int>(), - RecvRankCountCumSum.data_ptr<int>(), sendRankIndices.data_ptr<int>(), gatherSendRankIndices.data_ptr<int>(), + tensorrt_llm::kernels::moe_prepare::moveIndice(sendRankCountCumSum.data_ptr<int>(), + recvRankCountCumSum.data_ptr<int>(), sendRankIndices.data_ptr<int>(), gatherSendRankIndices.data_ptr<int>(), @@ - return std::make_tuple(sendRankCountCumSum, gatherSendRankIndices, RecvRankCountCumSum, gatherRecvRankIndices, + return std::make_tuple(sendRankCountCumSum, gatherSendRankIndices, recvRankCountCumSum, gatherRecvRankIndices, gatherBackwardRecvRankIndices, gatheredExpertStatics);
276-279
: Unify parameter naming style in schema (allWorkspace
→all_workspaces
).Elsewhere you use snake_case; keep it consistent across op schemas.
Apply this diff:
- "mnnvl_moe_alltoallv_prepare_without_allgather(Tensor experts_ids, Tensor? experts_statics, " - "Tensor allWorkspace, int max_token_count_per_rank, int ep_rank, int ep_size, int expert_count, int " + "mnnvl_moe_alltoallv_prepare_without_allgather(Tensor experts_ids, Tensor? experts_statics, " + "Tensor all_workspaces, int max_token_count_per_rank, int ep_rank, int ep_size, int expert_count, int "
146-149
: Avoid magic numbers for expert/slot divisibility and max bound.Replace 4 and 512 with named
constexpr
constants to document intent and ease maintenance.You can add near the top of this file (namespace scope):
namespace { constexpr int kEXPERT_SLOT_ALIGNMENT = 4; constexpr int kMAX_EXPERT_PLUS_ONE = 512; } // anonymous namespaceThen update:
- TORCH_CHECK(expertCount % 4 == 0, "expertCount must be divisible by 4"); - TORCH_CHECK(slotCount % 4 == 0, "slotCount must be divisible by 4"); - TORCH_CHECK(expertCount + 1 <= 512, "expertCount + 1 is larger than 512"); + TORCH_CHECK(expertCount % kEXPERT_SLOT_ALIGNMENT == 0, "expertCount must be divisible by 4"); + TORCH_CHECK(slotCount % kEXPERT_SLOT_ALIGNMENT == 0, "slotCount must be divisible by 4"); + TORCH_CHECK(expertCount + 1 <= kMAX_EXPERT_PLUS_ONE, "expertCount + 1 is larger than 512");
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/tensorrt_llm/thop/moeCommOp.cpp
(6 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
🧠 Learnings (2)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
403a4a9
to
e197aa5
Compare
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
8d7c3a5
to
b507f72
Compare
/bot run --disable-fail-fast |
PR_Github #16268 [ run ] triggered by Bot |
PR_Github #16257 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (10)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (9)
49-58
: Stop using global C RNG; use a local C++ RNG.Remove std::srand(42) and add a deterministic engine as a member. This also supports switching rand() to distributions below.
void SetUp() override { if (shouldSkip()) { skipped = true; GTEST_SKIP() << "Skipping due to no/unsupported GPU"; } TLLM_CUDA_CHECK(cudaStreamCreate(&stream)); - std::srand(42); // Initialize random seed } @@ - bool skipped = false; - cudaStream_t stream = nullptr; + bool skipped = false; + cudaStream_t stream = nullptr; + std::mt19937 rng{42}; // deterministic engine for testsAlso applies to: 68-70
71-107
: Replace rand() with distributions (deterministic, guideline-compliant).Prefer C++ RNGs over C APIs and avoid implicit RAND_MAX scaling.
else { // Default initialization with random values - for (size_t i = 0; i < count; i++) - { - if constexpr (std::is_same_v<T, float>) - { - (*hostPtr)[i] = static_cast<float>(rand()) / RAND_MAX * 10.0f; - } - else if constexpr (std::is_same_v<T, int>) - { - (*hostPtr)[i] = rand() % 1000; - } - else - { - (*hostPtr)[i] = static_cast<T>(rand() % 100); - } - } + std::uniform_real_distribution<float> floatDist(0.0f, 10.0f); + std::uniform_int_distribution<int> intDist(0, 999); + std::uniform_int_distribution<int> byteDist(0, 99); + for (size_t i = 0; i < count; i++) + { + if constexpr (std::is_same_v<T, float>) + { + (*hostPtr)[i] = floatDist(rng); + } + else if constexpr (std::is_same_v<T, int>) + { + (*hostPtr)[i] = intDist(rng); + } + else + { + (*hostPtr)[i] = static_cast<T>(byteDist(rng)); + } + } }
758-767
: Prefer RAII for temporary buffers to simplify cleanup.Use std::vector or std::unique_ptr for resultRecv* arrays to avoid manual delete[] and reduce error surface.
Example with std::vector:
- int* resultRecvTokenSlots = nullptr; + std::vector<int> resultRecvTokenSlots; ... - resultRecvTokenSlots = new int[tokenCount * topK]; - TLLM_CUDA_CHECK(cudaMemcpy( - resultRecvTokenSlots, deviceRecvTokenSlots, tokenCount * topK * sizeof(int), cudaMemcpyDeviceToHost)); + resultRecvTokenSlots.resize(tokenCount * topK); + TLLM_CUDA_CHECK(cudaMemcpy( + resultRecvTokenSlots.data(), deviceRecvTokenSlots, tokenCount * topK * sizeof(int), cudaMemcpyDeviceToHost));Also applies to: 1285-1294
1-15
: Switch to SPDX header (project standard).Replace the legacy Apache header block with the SPDX-style header used elsewhere in this repo.
-/* - * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ +// SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0
17-25
: Missing headers for used APIs (compile issues + UB fix prerequisites).
- You call std::random_shuffle and std::shuffle replacements need .
- You use std::memcpy in recommended fixes; add .
- You use std::is_same_v but <type_traits> is not included (this is a hard compile error).
Add these includes.
#include <atomic> #include <chrono> #include <functional> #include <gtest/gtest.h> #include <memory> #include <random> +#include <algorithm> +#include <cstring> +#include <type_traits> #include <thread> #include <vector>
109-113
: UB: deleting T[] via delete[] on a char; make cleanup type-aware and fix call sites.*Current delete[] static_cast<char*>(hostPtr) is undefined behavior for non-char arrays. Provide a templated deleter and use the correct element type everywhere.
- void cleanup(void* hostPtr, void* devicePtr) - { - delete[] static_cast<char*>(hostPtr); - TLLM_CUDA_CHECK(cudaFree(devicePtr)); - } + template <typename T> + void cleanupArray(T* hostPtr, void* devicePtr) + { + delete[] hostPtr; + TLLM_CUDA_CHECK(cudaFree(devicePtr)); + }Apply across call sites (examples):
- cleanup(hostTokenSlots, deviceTokenSlots); + cleanupArray<int>(hostTokenSlots, deviceTokenSlots); - cleanup(hostScales, deviceScales); + cleanupArray<float>(hostScales, deviceScales); - cleanup(hostFieldPtrs[i], deviceFieldPtrs[i]); + cleanupArray<uint8_t>(static_cast<uint8_t*>(hostFieldPtrs[i]), deviceFieldPtrs[i]); - cleanup(hostShmDump, deviceShmDump); + cleanupArray<int>(hostShmDump, deviceShmDump);Run this to list remaining old calls that need updating:
#!/bin/bash rg -nP 'cleanup\(' cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
148-151
: std::random_shuffle was removed in C++17; use std::shuffle with a seeded engine.This is a compile error with -std=c++17. Replace with std::shuffle.
- // Shuffle unused indices for random assignment - std::srand(42); // Fixed seed for reproducible tests - std::random_shuffle(unusedIndices.begin(), unusedIndices.end()); + // Shuffle unused indices for random assignment + std::mt19937 gen(42); // Fixed seed for reproducible tests + std::shuffle(unusedIndices.begin(), unusedIndices.end(), gen);
156-176
: Fill invalid mappings when unusedIndices are exhausted to avoid -1 entries.Prevent leaving -1 in fullMapping, which propagates invalid indices downstream.
if (i < providedSize && fullMapping[i] == -1) { // Fix invalid mapping - if (unusedIdx < unusedIndices.size()) + if (unusedIdx < static_cast<int>(unusedIndices.size())) { fullMapping[i] = unusedIndices[unusedIdx++]; } + else + { + // Fallback: identity mapping + fullMapping[i] = i; + } }
472-476
: Fix strict-aliasing UB when bit-packing float into int.Replace reinterpret_cast-based type punning with std::memcpy.
- float scale - = 1.0f + static_cast<float>(warpIdx) * 0.1f + static_cast<float>(offsetInWarp - topK) * 0.01f; - return *reinterpret_cast<int*>(&scale); + float scale = 1.0f + static_cast<float>(warpIdx) * 0.1f + + static_cast<float>(offsetInWarp - topK) * 0.01f; + int bits; + static_assert(sizeof(bits) == sizeof(scale), "size mismatch"); + std::memcpy(&bits, &scale, sizeof(bits)); + return bits;cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)
16-16
: Use project include guard instead of #pragma once.Per coding guidelines, replace #pragma once with TRTLLM_FUSEDMOECOMMKERNELS_H include guard.
-#pragma once +#ifndef TRTLLM_FUSEDMOECOMMKERNELS_H +#define TRTLLM_FUSEDMOECOMMKERNELS_H @@ -} // namespace kernels -} // namespace tensorrt_llm - +#endif // TRTLLM_FUSEDMOECOMMKERNELS_HAlso applies to: 561-563
🧹 Nitpick comments (19)
tests/unittest/_torch/thop/test_moe_alltoall.py (4)
75-81
: Avoid silent truncation when sizing workspaces; compute u64 words with ceiling.
get_moe_commworkspace_size_per_rank()
returns bytes. Using// 8
truncates if the size isn’t 8-byte aligned (future changes could break this). Compute words with a ceiling division in all three places and use a clearer variable name.Apply this diff in the three blocks:
- workspace_size = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(1) - all_workspaces = torch.zeros(1, - workspace_size // 8, + workspace_size_bytes = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(1) + workspace_words = (workspace_size_bytes + 7) // 8 + all_workspaces = torch.zeros(1, + workspace_words, dtype=torch.uint64, device=torch.device('cuda'))And similarly in
do_warmup()
and the multi-rank section:- workspace_size = torch.ops.trtllm.get_moe_commworkspace_size_per_rank( - world_size) - all_workspaces = torch.zeros(world_size, - workspace_size // 8, + workspace_size_bytes = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(world_size) + workspace_words = (workspace_size_bytes + 7) // 8 + all_workspaces = torch.zeros(world_size, + workspace_words, dtype=torch.uint64, device=torch.device('cuda'))Also applies to: 113-119, 254-260
82-89
: Clarify/standardize the optional boolean list arg tomoe_comm
.Here you pass
[True]
, while the multi-rank calls below omit the flag. If the last argument toggles “combine” or “gathered” mode, keep usage consistent in tests (or add a short comment on its semantics) to reduce reader confusion.
95-113
: Warmup defeats stream isolation due to global synchronizes.
do_warmup()
callstorch.cuda.synchronize()
at entry/exit, so wrapping it in per-rank streams won’t avoid first-launch syncs. Either drop the leading sync or allow an optionalstream
arg and run the ops on that stream to preserve isolation.Minimal tweak:
- def do_warmup(self): - torch.cuda.synchronize() + def do_warmup(self): input_tensor = torch.randn(1, 8,Also applies to: 119-122
526-526
: Wrap overlong lines flagged by Ruff (E501).A few lines exceed 120 chars. Break them across lines (parentheses) to satisfy style without disabling linting.
Also applies to: 528-528, 606-607
cpp/tensorrt_llm/kernels/moePrepareKernels.h (4)
43-48
: Avoid magic-numbered FIFO array size; introduce a named constant.
values[512]
is opaque and hard to change. Prefer aconstexpr
and reuse it for the member.namespace moe_prepare { -#define UNIT_PER_PIPELINE 128 +#define UNIT_PER_PIPELINE 128 #define PIPELINE_PER_CTA 4 #define CUMSUM_THREADS_PER_BLOCK 128 static constexpr int THREADS_PER_PIPELINE = UNIT_PER_PIPELINE; +static constexpr int kFIFO_VALUES_SIZE = 512; @@ struct ALIGN_256 MoeCommFifoConnInfo { volatile uint64_t head; // write position volatile uint64_t tail; // read position - int volatile values[512]; // for values + int volatile values[kFIFO_VALUES_SIZE]; // for values };
59-66
: Document/validate alignment ofworkspacePtr
before casting toMoeCommFifoConnInfo*
.Given
ALIGN_256
onMoeCommFifoConnInfo
, consider adding an assertion or a brief comment guaranteeingworkspacePtr
’s base andrankStrideInU64
keep each instance 256B-aligned to avoid misaligned accesses on device.
1-15
: License years may need updating to include 2025.Repo-wide policy often bumps to the current year on touched files. If that applies here, extend
2022-2024
to2022-2025
.
71-75
: AlignmemsetExpertIds
parameter namesThe declaration in
moePrepareKernels.h
usesepSize
, while the definition inmoePrepareKernels.cu
usesrankCount
. The other kernels (computeCumsum
,moveIndice
) already match parameter names between header and implementation.To resolve, pick one of the following:
- Rename the header parameter to
rankCount
- Rename the
.cu
parameter toepSize
Example header change (rename to
rankCount
):--- a/cpp/tensorrt_llm/kernels/moePrepareKernels.h +++ b/cpp/tensorrt_llm/kernels/moePrepareKernels.h @@ -80,7 +80,7 @@ void memsetExpertIds(int* expertIds, int* recvCountsCumsum, int maxTokenCountPe - int epSize, cudaStream_t stream); + int rankCount, cudaStream_t stream);Files to update:
cpp/tensorrt_llm/kernels/moePrepareKernels.h
(declaration)cpp/tensorrt_llm/kernels/moePrepareKernels.cu
(definition)Verification script (already run) confirms only
memsetExpertIds
is mismatched:rg -nP 'void\s+memsetExpertIds\s*\(' -C2 cpp/tensorrt_llm/kernels rg -nP 'void\s+computeCumsum\s*\(' -C2 cpp/tensorrt_llm/kernels rg -nP 'void\s+moveIndice\s*\(' -C2 cpp/tensorrt_llm/kernelstensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)
353-358
: Useself.num_slots
for slot_count for clarity (even if equal to num_experts).In this module
num_slots == num_experts
, but conceptually they differ in other backends. Passingself.num_slots
improves consistency with WideEP and reduces future diff churn.- max_num_token, self.ep_rank, self.ep_size, self.num_experts, - self.num_experts, top_k) + max_num_token, self.ep_rank, self.ep_size, self.num_experts, + self.num_slots, top_k)
373-376
: Minor: maintainx_row
only once.You set
x_row
earlier and reassign fromx_sf.shape[0]
here. Both should match; consider asserting equality or dropping the reassignment to avoid hidden skew if upstream changes.tensorrt_llm/_mnnvl_utils.py (1)
339-347
: Type annotation mismatch:local_gather_indices
can be None.You assign
None
but the dataclass types it astorch.Tensor
. Make it Optional to reflect reality and help static analyzers.-from dataclasses import dataclass -from typing import List, Optional, Union +from dataclasses import dataclass +from typing import List, Optional, Union @@ -@dataclass -class MoEAlltoallInfo: - local_gather_indices: torch.Tensor +@dataclass +class MoEAlltoallInfo: + local_gather_indices: Optional[torch.Tensor]Also applies to: 433-444
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
465-476
: Two-stage prepare/dispatch API looks clean; minor naming nit.
alltoall_prepare
returnstoken_selected_slots
unchanged; consider returning only the new artifacts (gathered_*
,alltoall_info
) to reduce redundancy, or rename toprepare_slots_and_info
if the intent is to emphasize slot-based routing.Also applies to: 867-876
879-895
: Dispatch +memset_expert_ids
is correct; ensure scales can be None.Downstream
fused_moe
acceptstoken_final_scales=None
for the “apply-on-input” path. Your dispatch preserves that — good. Consider a brief comment here to clarify thattoken_final_scales
may be None by design.
465-465
: Ruff E501: one overlong line; wrap to meet 120 cols.Wrap line 465 to satisfy linting.
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (1)
296-321
: Optional: avoid reinterpret_cast<float const>(int) when reading back scales.**Reading floats through an int* via reinterpret_cast violates aliasing rules; consider memcpy per element for safety in both verifyG2SResults and verifyS2GResults.
Also applies to: 518-526
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (4)
38-40
: Clarify or remove “do we use this?” in public header.Question-style comments in public headers age poorly. Either justify the member or remove the aside.
109-120
: Avoid shadowing ‘alignedUnitBytes’ inside getFieldUncompactSize.Shadowing the name with a constexpr of the same identifier harms readability. Use a distinct constant.
- if (alignedUnitBytes != 16) - { - constexpr int alignedUnitBytes = BYTES_PER_16B_BLOCK; - currentFieldSize = currentFieldSize / alignedUnitBytes * alignedUnitBytes; - currentFieldSize += alignedUnitBytes * 2; - } + if (alignedUnitBytes != BYTES_PER_16B_BLOCK) + { + constexpr int kAlignedBytes = BYTES_PER_16B_BLOCK; + currentFieldSize = currentFieldSize / kAlignedBytes * kAlignedBytes; + currentFieldSize += kAlignedBytes * 2; + }
485-491
: Typo: mappedMemoryrank → mappedMemoryRank.Minor readability polish in pointer arithmetic helper.
- int mappedMemoryrank = isSenderSideBuffer ? pairInfo.senderRank : pairInfo.receiverRank; + int mappedMemoryRank = isSenderSideBuffer ? pairInfo.senderRank : pairInfo.receiverRank; - auto* mappedMemory = reinterpret_cast<uint8_t*>(workspacePtr + mappedMemoryrank * rankStrideInU64); + auto* mappedMemory = reinterpret_cast<uint8_t*>(workspacePtr + mappedMemoryRank * rankStrideInU64);
292-300
: Fix typos in diagnostics and variable name.“at lease” → “at least”; “perferredChannel” → “preferredChannel”.
- TLLM_CHECK_WITH_INFO( - blockCountPerChannel <= smCount, "GPU should support at lease one channel, usableSmCount=%d", smCount); - int perferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication - int channelCount = std::max(perferredChannel, 1); // at lease one channel + TLLM_CHECK_WITH_INFO( + blockCountPerChannel <= smCount, "GPU should support at least one channel, usableSmCount=%d", smCount); + int preferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication + int channelCount = std::max(preferredChannel, 1); // at least one channel
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (19)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(1 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(1 hunks)cpp/tensorrt_llm/kernels/moeCommKernels.cu
(0 hunks)cpp/tensorrt_llm/kernels/moeCommKernels.h
(0 hunks)cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
(1 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.cu
(7 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.h
(3 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(6 hunks)cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
(0 hunks)cpp/tests/unit_tests/kernels/CMakeLists.txt
(1 hunks)cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
(1 hunks)tensorrt_llm/_mnnvl_utils.py
(5 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(2 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(2 hunks)tensorrt_llm/_torch/models/modeling_speculative.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(5 hunks)tests/integration/test_lists/waives.txt
(0 hunks)tests/unittest/_torch/thop/test_moe_alltoall.py
(9 hunks)
💤 Files with no reviewable changes (4)
- cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
- tests/integration/test_lists/waives.txt
- cpp/tensorrt_llm/kernels/moeCommKernels.h
- cpp/tensorrt_llm/kernels/moeCommKernels.cu
🚧 Files skipped from review as they are similar to previous changes (8)
- tensorrt_llm/_torch/models/modeling_speculative.py
- cpp/tests/unit_tests/kernels/CMakeLists.txt
- cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
- tensorrt_llm/_torch/models/modeling_deepseekv3.py
- cpp/tensorrt_llm/kernels/moePrepareKernels.cu
- cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
- tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
- cpp/tensorrt_llm/thop/moeCommOp.cpp
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_mnnvl_utils.py
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_mnnvl_utils.py
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
🧠 Learnings (6)
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tests/unittest/_torch/thop/test_moe_alltoall.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files
Applied to files:
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{h,hpp,hxx,hh,cuh} : Header files must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
🧬 Code graph analysis (7)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)
tensorrt_llm/_torch/utils.py (2)
_
(190-196)shape
(103-104)tensorrt_llm/_mnnvl_utils.py (3)
MnnvlMoe
(350-617)mnnvl_moe_alltoallv_prepare_without_allgather
(399-443)mnnvl_moe_alltoallv
(528-589)tensorrt_llm/quantization/utils/fp8_utils.py (1)
ceil_div
(10-21)
tensorrt_llm/_mnnvl_utils.py (1)
tensorrt_llm/python_plugin.py (2)
workspace
(331-333)workspace
(336-338)
tests/unittest/_torch/thop/test_moe_alltoall.py (2)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (16)
_
(13-56)_
(60-62)_
(65-68)_
(71-76)_
(79-84)_
(87-99)_
(102-107)_
(110-115)_
(118-123)_
(126-138)_
(141-147)_
(150-151)_
(154-157)_
(161-162)_
(165-166)_
(169-180)tensorrt_llm/_mnnvl_utils.py (1)
mnnvl_moe_alltoallv_prepare_without_allgather
(399-443)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_mnnvl_utils.py (8)
MnnvlMemory
(53-336)initialize
(91-100)MnnvlMoe
(350-617)get_moe_workspaces
(358-373)get_moe_prepare_workspace
(376-387)mnnvl_moe_alltoallv_prepare_without_allgather
(399-443)MoEAlltoallInfo
(340-347)mnnvl_moe_alltoallv
(528-589)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (3)
is_static_routing
(350-351)is_static_routing
(759-761)get_local_statistic_tensor
(563-577)
cpp/tensorrt_llm/kernels/moePrepareKernels.h (2)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (6)
computeCumsum
(304-310)computeCumsum
(304-304)moveIndice
(312-320)moveIndice
(312-314)memsetExpertIds
(322-329)memsetExpertIds
(322-323)cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
memsetExpertIds
(225-243)memsetExpertIds
(225-226)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (3)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
kernels
(27-207)computeWorkspaceSizePreRank
(520-526)cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1)
kernels
(22-46)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (12)
tokenCount
(736-776)tokenCount
(736-736)tokenCount
(788-839)tokenCount
(788-788)launchSingleG2S
(1137-1152)launchSingleG2S
(1137-1138)launchSingleS2G
(1190-1205)launchSingleS2G
(1190-1191)launchLoopback
(1273-1294)launchLoopback
(1273-1275)launchLocalFifoSendRecv
(1335-1367)launchLocalFifoSendRecv
(1335-1338)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (3)
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (2)
tensorrt_llm
(20-47)kernels
(22-46)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (24)
void
(177-215)void
(872-936)void
(1097-1135)void
(1154-1188)void
(1207-1270)void
(1297-1333)fillFieldInfo
(177-177)setMaxUsableSmCount
(1070-1073)setMaxUsableSmCount
(1070-1070)int
(344-350)getFusedMoeCommWorkspaceSize
(1075-1080)getFusedMoeCommWorkspaceSize
(1075-1075)fillMetaInfo
(952-959)fillMetaInfo
(952-953)fillFieldPlacementInfo
(961-987)fillFieldPlacementInfo
(961-961)initializeLocalWorkspace
(989-1001)initializeLocalWorkspace
(989-989)moeAllToAll
(1003-1065)moeAllToAll
(1003-1003)constructWorkspace
(1082-1087)constructWorkspace
(1082-1082)initializeFusedMoeLocalWorkspace
(1089-1092)initializeFusedMoeLocalWorkspace
(1089-1089)cpp/include/tensorrt_llm/common/cudaUtils.h (1)
getMultiProcessorCount
(393-400)
🪛 Ruff (0.12.2)
tests/unittest/_torch/thop/test_moe_alltoall.py
526-526: Line too long (236 > 120)
(E501)
528-528: Line too long (248 > 120)
(E501)
606-606: Line too long (186 > 120)
(E501)
607-607: Line too long (136 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
465-465: Line too long (124 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
tests/unittest/_torch/thop/test_moe_alltoall.py (3)
125-139
: Good parameter coverage for multi-rank smoke/large tests.
236-239
: Reference reconstruction logic is correct but fragile to future shape changes.The global
output_indice_offset
assumes each token contributes exactly once across ranks. If future tests exercise multi-lane routes (top-k > 1), this needs adjustment to accumulate per-rank-per-tensor counts.Would you like me to extend the test to a top-k>1 variant and validate the reconstruction order?
271-286
: Nice parallel a2a invocation and reconstruction with per-rank streams.Also applies to: 290-306
cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)
35-36
: Nice: replaced macro with a constexpr for threads-per-pipeline.tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
363-371
: Good: unified multi-tensor alltoallv dispatch and post-population of expert IDs.This matches the fused path and removes the padding hacks.
tensorrt_llm/_mnnvl_utils.py (3)
370-373
: Nice: explicit workspace init + barrier avoids first-use races.
528-590
: Multi-tensor a2a API: solid handling of None entries and single-tensor round-trip.The validation and restoration logic look correct.
602-615
: Combine path correctly switches to list-basedmoe_comm
and unpacks.The reshape-and-sum pattern matches the expected top-k combine semantics.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
196-202
: Good: allocate both a2a and prepare workspaces under MNNVL.This mirrors the fused path expectations.
299-301
: Always-on alltoall gating for MNNVL is reasonable.This simplifies control flow for the fused path.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
315-329
: Confirm launch dimension policy.The 2 in grid.z encodes both directions; ensure kernels assume grid.z==2 and no divergence when hasBasicFields=false. If not guaranteed, guard by template parameter or assert.
Would you like a quick audit script to grep kernel launch sites and ensure consistent interpretation of grid.z across kernels?
1-15
: Adopt SPDX header.Align with repository convention: replace the block header with SPDX lines.
-/* - * Copyright (c) 2019-2025, NVIDIA CORPORATION. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ +// SPDX-FileCopyrightText: Copyright (c) 2019-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0⛔ Skipped due to learnings
Learnt from: CR PR: NVIDIA/TensorRT-LLM#0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-08-12T10:28:57.320Z Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files
/bot run --disable-fail-fast --reuse-test |
PR_Github #16282 [ run ] triggered by Bot |
PR_Github #16268 [ run ] completed with state |
PR_Github #16282 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16287 [ run ] triggered by Bot |
PR_Github #16287 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16296 [ run ] triggered by Bot |
PR_Github #16296 [ run ] completed with state |
Change include:
Summary by CodeRabbit
New Features
Refactor
Tests
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.