KEMBAR78
[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP by dongxuy04 · Pull Request #6973 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@dongxuy04
Copy link
Collaborator

@dongxuy04 dongxuy04 commented Aug 18, 2025

Change include:

  • Refactor alltoall kernel to support alltoall fusion of multiple fields
  • Fuse alltoall kernel and meta alltoall (@WeiHaocheng )
  • Optimize count and expert statistic (@WeiHaocheng )
  • Some code clean up (@WeiHaocheng )

Summary by CodeRabbit

  • New Features

    • Unified fused MOE all‑to‑all with multi‑tensor inputs, arbitrary top_k, explicit per‑rank workspace initialization, and expert‑ID memset; public kernel launchers and workspace APIs exposed.
  • Refactor

    • Replaced legacy per‑tensor prepare/indices flow with a fused field‑info/workspace prepare→dispatch model; consolidated MNNVL all‑to‑all paths.
  • Tests

    • Added extensive CUDA unit tests covering G2S/S2G/loopback/local‑FIFO and expanded Python multi‑tensor/multi‑rank/FP8 tests.
  • Chores

    • Removed legacy prepare/local‑gather ops and updated Torch bindings and helper utilities.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@dongxuy04 dongxuy04 requested review from a team as code owners August 18, 2025 01:42
@dongxuy04 dongxuy04 requested review from QiJune and yizhang-nv August 18, 2025 01:42
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Adds a new fused MoE communication implementation (kernels, headers, host APIs, workspace management, and tests), removes the legacy moeComm kernels, refactors prepare kernels to per-slot statics, updates Torch bindings and Python utilities to a fused multi-tensor workflow, and adds comprehensive CUDA unit tests.

Changes

Cohort / File(s) Summary
Fused MoE kernels & headers (new)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu, cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h, cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
New fused-MoE implementation: device helpers (pointer translation, fences, barrier primitives), Lamport/Ll128 proto path and proto sizing helpers (FusedMoeProto = Ll128Proto), per-field pack/unpack/memmove, SingleChannelCommunicator FIFO logic, templated moeAllToAllKernel and host moeAllToAll launcher, workspace sizing/construct/initialize APIs, world/expert structs and constants, and kernel-level test helpers.
Legacy MoE kernels removed
cpp/tensorrt_llm/kernels/moeCommKernels.cu, cpp/tensorrt_llm/kernels/moeCommKernels.h
Removes the previous AllToAllChannelCommunicator and all associated device kernels, host launchers, index/prepare kernels, workspace types, constants, and public declarations.
Prepare kernels refactor
cpp/tensorrt_llm/kernels/moePrepareKernels.cu, cpp/tensorrt_llm/kernels/moePrepareKernels.h
Replaces StepCommunicator/PacketPipeline with per-slot counters/statics; MoeCommFifoConnInfo now has per-slot values[]; computeCount/computeCountAndIndice signatures accept expertStatics/gatheredExpertStatics and slotCount; adds memsetExpertIds; workspace sizing simplified to per-rank FIFO info.
Torch C++ ops / bindings
cpp/tensorrt_llm/thop/moeCommOp.cpp, cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
Migrates ops to fused path: moe_comm now accepts lists of tensors and builds FusedMoeCommKernelParam/FusedMoeWorkspace; adds moe_initialize_workspace, memset_expert_ids, updates workspace-size binding to fused API; removes legacy prepare/local_gather ops and includes.
Python utils & op registry
tensorrt_llm/_mnnvl_utils.py, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
get_moe_workspaces now calls moe_initialize_workspace; prepare API drops scales and returns statics; mnnvl_moe_alltoallv gains multi-tensor support and handles None entries; removed legacy fake-ops for prepare/local_gather; added memset_expert_ids and moe_initialize_workspace fake-ops.
Fused MoE Python modules
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py, tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
Consolidates alltoall dispatch to unified multi-tensor path, removes top_k%4 restriction and padding hacks, introduces two-stage alltoall_prepare / alltoall_dispatch, calls memset_expert_ids after dispatch, and propagates alltoall_info.
C++ unit tests (new)
cpp/tests/unit_tests/kernels/CMakeLists.txt, cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
Adds fusedMoeCommKernelTest to CMake and a comprehensive GoogleTest suite exercising G2S, S2G, loopback, and local-FIFO paths across many parameterizations, with setup/teardown, alloc/init helpers, and verification routines.
Python tests update
tests/unittest/_torch/thop/test_moe_alltoall.py
Tests updated for list-based moe_comm usage, per-rank workspace init via moe_initialize_workspace, multi-tensor orchestration, multi-rank single-GPU flows, and added FP8-focused tests.
Minor model changes & waives
tensorrt_llm/_torch/models/*, tests/integration/test_lists/waives.txt
Small model import/behavior tweaks (speculative/MTP load-balancer call adjustments) and removal of two test waives entries.

Sequence Diagram(s)

sequenceDiagram
  participant Py as Python
  participant TH as Torch C++ binding
  participant FM as FusedMoE C++ APIs
  participant K as CUDA Kernels

  Py->>TH: moe_initialize_workspace(all_workspaces, ep_rank, ep_size)
  TH->>FM: initializeFusedMoeLocalWorkspace(...)
  FM->>K: workspace init kernel(s)
  K-->>FM: done
  FM-->>TH: return
  TH-->>Py: return

  Py->>TH: moe_comm([inputs], send_cumsum, send_idx, recv_cumsum, recv_idx, all_workspaces, output_alloc_count, ep_rank, ep_size)
  TH->>FM: moeAllToAll(params, workspace, stream)
  FM->>K: moeAllToAllKernel<<<...>>>(...)
  K-->>FM: transfers complete
  FM-->>TH: return
  TH-->>Py: return

  Py->>TH: memset_expert_ids(expert_ids, recv_cumsum, max_tokens, top_k, slot_count, ep_size)
  TH->>K: memsetExpertIds kernel
  K-->>TH: done
  TH-->>Py: return
Loading
sequenceDiagram
  participant Mod as WideEPMoE
  participant Util as _mnnvl_utils
  participant TH as Torch C++ binding
  participant FM as FusedMoE C++ APIs
  participant K as CUDA Kernels

  Mod->>Util: mnnvl_moe_alltoallv_prepare_without_allgather(...)
  Util-->>Mod: alltoall_info, gathered_statics
  Mod->>Util: mnnvl_moe_alltoallv([x, x_sf, slots, scales], alltoall_info, workspace, ep_rank, ep_size)
  Util->>TH: moe_comm([...], ..., all_workspaces, output_alloc_count, ep_rank, ep_size)
  TH->>FM: moeAllToAll(...)
  FM->>K: moeAllToAllKernel<<<...>>>(...)
  K-->>FM: done
  FM-->>TH: return
  TH-->>Util: return
  Util-->>Mod: [x', x_sf', slots', scales']
  Mod->>TH: memset_expert_ids(...)
  TH-->>Mod: return
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

SW Architecture

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@dongxuy04 dongxuy04 requested a review from WeiHaocheng August 18, 2025 01:42
@dongxuy04
Copy link
Collaborator Author

/bot run

@dongxuy04 dongxuy04 changed the title Optimize and refactor alltoall in WideEP [None][Optimization]Optimize and refactor alltoall in WideEP Aug 18, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #15561 [ run ] triggered by Bot

@dongxuy04 dongxuy04 changed the title [None][Optimization]Optimize and refactor alltoall in WideEP [None][Feat]Optimize and refactor alltoall in WideEP Aug 18, 2025
@dongxuy04 dongxuy04 changed the title [None][Feat]Optimize and refactor alltoall in WideEP [None][feat]Optimize and refactor alltoall in WideEP Aug 18, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🔭 Outside diff range comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

195-198: Inconsistent top_k guard removal.

The comment on line 194 indicates "alltoall without allgather only supports top_k % 4 == 0", but the condition was not fully removed. This creates inconsistency between the comment and actual implementation. The enable_alltoall property now checks if experts_per_token % 4 == 0 rather than supporting generic top_k values as the broader changes suggest.

Either update the comment to reflect the actual constraint or fully remove the % 4 == 0 check:

-                and self.routing_method.experts_per_token % 4 ==
-                0  # alltoall without allgather only supports top_k % 4 == 0
+                # Generic top_k values now supported
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)

141-209: Missing validation for expert count divisibility.

Line 150-151 checks that expertCount + 1 <= 512, but this check seems incomplete. The error message is also not descriptive enough about why this constraint exists.

     CHECK_INPUT(expertsIds, torch::kInt32);
     TORCH_CHECK(expertCount % 4 == 0, "expertCount must be divisible by 4");
     TORCH_CHECK(slotCount % 4 == 0, "slotCount must be divisible by 4");
-    TORCH_CHECK(expertCount + 1 <= 512, 
-        "expertCount + 1 is larger than 512");
+    TORCH_CHECK(expertCount + 1 <= 512, 
+        "expertCount + 1 (" + std::to_string(expertCount + 1) + 
+        ") exceeds maximum supported value of 512");
🧹 Nitpick comments (9)
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (2)

18-18: Prefer C++ headers: use instead of <stdint.h>.

Use C++ headers to align with coding standards and avoid pulling symbols into the global namespace unexpectedly.

-#include <stdint.h>
+#include <cstdint>

31-33: Potential duplication/confusion of WARP constants across the codebase.

WARP_SIZE and WARP_MASK are also defined/used elsewhere (e.g., FusedMoeCommunicator, kernels .cu). Consider centralizing these definitions in this common header and consuming them via this header only to avoid divergence.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

1322-1326: Magic number for 48 KiB dynamic shared memory threshold.

Prefer a named constexpr (e.g., kDEFAULT_SMEM_LIMIT_PER_BLOCK) instead of 48 * 1024 to document intent. It also eases future adjustments.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)

109-121: Variable shadowing inside getFieldUncompactSize; use a distinct name or the constant directly.

alignedUnitBytes is re-declared as a constexpr within the if block, shadowing the outer variable. This is confusing and error-prone. Use a different name or the constant directly.

-    if (alignedUnitBytes != 16)
+    if (alignedUnitBytes != 16)
     {
-        constexpr int alignedUnitBytes = BYTES_PER_16B_BLOCK;
-        currentFieldSize = currentFieldSize / alignedUnitBytes * alignedUnitBytes;
-        currentFieldSize += alignedUnitBytes * 2;
+        currentFieldSize = (currentFieldSize / BYTES_PER_16B_BLOCK) * BYTES_PER_16B_BLOCK;
+        currentFieldSize += BYTES_PER_16B_BLOCK * 2;
     }

260-318: Duplicate WARP_SIZE definition vs common header.

FusedMoeCommunicator defines its own WARP_SIZE = 32, while moeCommKernelsCommon.h also defines WARP_SIZE and WARP_MASK. Prefer a single source of truth (e.g., the common header) to reduce risk of divergence.

cpp/tensorrt_llm/thop/moeCommOp.cpp (2)

31-36: Consider adding alignment checks for field info.

The function accepts a 2D tensor and extracts element size and stride information. Consider validating that the tensor's memory layout meets any alignment requirements that the fused MOE kernels might expect.

 void setMoeCommFieldInfo(tensorrt_llm::kernels::MoeCommFieldInfo& fieldInfo, torch::Tensor const& tensor)
 {
     TORCH_CHECK(tensor.dim() == 2, "tensor must be a 2D tensor");
     int eltSize = tensor.dtype().itemsize();
+    // Validate that the tensor is contiguous or meets alignment requirements
+    TORCH_CHECK(tensor.is_contiguous(), "tensor must be contiguous for optimal MOE communication");
     fieldInfo.fillFieldInfo(static_cast<uint8_t*>(tensor.data_ptr()), eltSize, tensor.size(1), tensor.stride(0));
 }

231-231: Add namespace closing comment.

According to the coding guidelines, namespaces should be closed with a comment naming the namespace.

-} // namespace torch_ext
+} // namespace torch_ext
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (2)

314-314: Use EXPECT_FLOAT_EQ for float comparisons.

For floating-point comparisons, prefer EXPECT_FLOAT_EQ over EXPECT_NEAR with a fixed epsilon when comparing exact values.

-                        EXPECT_NEAR(expected, actual, 1e-6f) << "Scale mismatch at warp=" << tokenId << ", k=" << k;
+                        EXPECT_FLOAT_EQ(expected, actual) << "Scale mismatch at warp=" << tokenId << ", k=" << k;

1280-1289: Consistent cleanup pattern needed.

The cleanup section has inconsistent patterns for checking null pointers before deletion. Array deletion is safe with null pointers in C++.

         // Cleanup temporary arrays
-        if (resultRecvTokenSlots)
-            delete[] resultRecvTokenSlots;
-        if (resultRecvScales)
-            delete[] resultRecvScales;
+        delete[] resultRecvTokenSlots;  // Safe even if null
+        delete[] resultRecvScales;      // Safe even if null
         for (int i = 0; i < fieldCount; i++)
         {
-            if (resultRecvFields[i])
-                delete[] resultRecvFields[i];
+            delete[] resultRecvFields[i];  // Safe even if null
         }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ce0b13e and 02bf032.

📒 Files selected for processing (16)
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/moeCommKernels.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/moeCommKernels.h (0 hunks)
  • cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1 hunks)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu (7 hunks)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h (3 hunks)
  • cpp/tensorrt_llm/thop/moeCommOp.cpp (6 hunks)
  • cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp (0 hunks)
  • cpp/tests/unit_tests/kernels/CMakeLists.txt (1 hunks)
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (1 hunks)
  • tensorrt_llm/_mnnvl_utils.py (5 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5 hunks)
  • tests/unittest/_torch/thop/test_moe_alltoall.py (8 hunks)
💤 Files with no reviewable changes (4)
  • cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
  • cpp/tensorrt_llm/kernels/moeCommKernels.cu
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/kernels/moeCommKernels.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
  • tensorrt_llm/_mnnvl_utils.py
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_mnnvl_utils.py
🧠 Learnings (2)
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🪛 Ruff (0.12.2)
tests/unittest/_torch/thop/test_moe_alltoall.py

919-919: Redefinition of unused TestMoeAlltoAllFP8SingleGPU from line 319

(F811)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

465-465: Line too long (124 > 120)

(E501)

🔇 Additional comments (22)
tensorrt_llm/_mnnvl_utils.py (3)

369-373: Workspace initialization + barrier looks correct.

Initializing the workspace via torch.ops.trtllm.moe_initialize_workspace(...) and then hitting a communicator barrier ensures ranks see a consistent FIFO state before first use.


541-571: Dimension validation and preallocation path looks solid.

  • Validates 2D tensors and consistent first dimension for non-None entries.
  • Pre-allocates outputs with correct sizes and dtypes/devices.
  • Executes a single fused moe_comm call across the valid subset.

616-625: Updated moe_comm invocation in combine() matches fused list-based API.

Wrapping inputs/outputs in 1-element lists is the right adaptation for the fused path here.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)

50-67: Host-side helpers inside SendRecvIndices conditionally compiled only under CUDACC.

If these helpers are intended for host code in non-CUDA unit tests or CPU stubs, consider moving __host__ versions outside the #ifdef __CUDACC__ guard. If device-only by design, this is fine.

cpp/tests/unit_tests/kernels/CMakeLists.txt (1)

45-46: Nice addition: fused MoE comm kernel test registered.

Registering fusedMoeCommKernelTest improves coverage for the new fused path. Ensure the target links the correct CUDA archs (SM80/SM90) depending on CI matrix so the inline PTX guards are exercised.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)

372-374: Good addition of expert ID population.

The explicit call to memset_expert_ids ensures proper initialization of expert IDs using the alltoall_info metadata. This is a crucial step that was previously handled implicitly.


376-379: FP4 scaling factor shape handling looks correct.

The code properly reshapes the scaling factor tensor after the alltoall dispatch and applies swizzle_sf transformation when needed. The conditional check ensures this only happens when x_sf exists.


357-360: No cleanup needed: API still requires expert_statics argument
The signature of mnnvl_moe_alltoallv_prepare_without_allgather in _mnnvl_utils.py still declares the second parameter as

expert_statics: Optional[torch.Tensor]

and several call sites (e.g. in fused_moe_wide_ep.py and the unit tests) pass a real tensor for this argument. Removing the None in fused_moe_cutlass.py would break the API consistency. Leave the None so that the wrapper continues to satisfy the current signature.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

299-300: Simplified control flow with MNNVL alltoall.

The can_use_alltoall method now unconditionally returns True for MNNVL, enforcing alltoall usage. This simplification is consistent with the two-stage prepare/dispatch flow but removes flexibility for runtime decision-making.


867-893: Well-structured two-stage alltoall API.

The new alltoall_prepare and alltoall_dispatch methods provide a clean separation of concerns. The prepare phase handles metadata computation while dispatch performs the actual data transfer. This design allows for potential optimizations between the two phases.


465-474: Ensure correct handling of local statistics across repeated calls

Please verify that when repeat_count > 1, the local‐statistic tensor is only gathered on the final invocation and that earlier calls correctly pass None into alltoall_prepare. In particular:

  • Confirm that self.layer_load_balancer.get_local_statistic_tensor() is invoked only when
    1. is_last_call is True
    2. self.layer_load_balancer exists and reports non‐static routing.
  • Verify that for all preceding calls (is_last_call == False), loadbalancer_local_statistic_info is indeed None, and that gathered_loadbalancer_local_statistic_info produced by alltoall_prepare is handled correctly (i.e., remains None or merges properly).
  • Add or extend unit/integration tests to simulate repeat_count > 1 and inspect the sequence of tensors through these branches.

Affected block (tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py, ~lines 465–474):

-   if is_last_call and self.layer_load_balancer is not None and not self.layer_load_balancer.is_static_routing():
-       loadbalancer_local_statistic_info = self.layer_load_balancer.get_local_statistic_tensor()
-   else:
-       loadbalancer_local_statistic_info = None
+   # Only gather local stats on the final call when dynamic routing is enabled
+   if is_last_call
+       and self.layer_load_balancer is not None
+       and not self.layer_load_balancer.is_static_routing():
+       loadbalancer_local_statistic_info = (
+           self.layer_load_balancer.get_local_statistic_tensor()
+       )
+   else:
+       loadbalancer_local_statistic_info = None
tests/unittest/_torch/thop/test_moe_alltoall.py (3)

81-84: Workspace initialization update looks correct.

The workspace size is now divided by 8 (uint64 elements) and initialized using the new moe_initialize_workspace API. This aligns with the broader workspace management changes in the PR.


160-180: Good multi-tensor test coverage.

The test now properly handles multiple tensor dimensions by creating separate input/output tensors for each dimension in vector_dims. The per-tensor orchestration and verification logic is well-structured.


394-399: Inconsistent workspace size handling in FP8 test.

The FP8 test uses workspace_size directly as the second dimension of the workspace tensor, while other tests divide it by 8. This inconsistency could cause issues.

Apply the same workspace size handling:

         workspace_size = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(1)
         all_workspaces = torch.zeros(1,
-                                     workspace_size,
+                                     workspace_size // 8,
                                      dtype=torch.uint64,
                                      device='cuda')
+        torch.ops.trtllm.moe_initialize_workspace(all_workspaces, 0, 1)
⛔ Skipped due to learnings
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (5)

56-78: Per-slot counter implementation looks correct.

The refactored CounterCommunicator now operates on a per-slot values array with proper synchronization using acquire/release semantics. The +1/-1 offset ensures 0 is never used as a valid value, avoiding busy-wait on uninitialized memory.


85-88: Function signature properly extended for statistics.

The computeCountAndSendStatics function now accepts expertStatics, slotCount, and expertCount parameters to support the new per-slot statistics-based workflow. The naming accurately reflects the extended functionality.


126-137: Statistics communication logic is well-structured.

The code correctly handles both the count (i==0) and expert statistics (i>0) in a unified loop. The conditional logic ensures backward compatibility when expertStatics is nullptr.


328-330: Simplified workspace size calculation.

The workspace size is now simply sizeof(MoeCommFifoConnInfo) * epSize, which is much cleaner than the previous FIFO-based formula. This aligns with the per-slot counter approach.


318-325: memsetExpertIds wrapper is correct

The host function passes the expertIds device pointer directly to the CUDA kernel, and the kernel’s implementation writes into expertIds exactly as intended—there is no stray typo or missing offset in the wrapper.

– The kernel signature (int* expertIds, …) matches the argument passed by the host.
– Inside memsetExpertIdsDevice, writes to expertIds occur at the computed indices.

No changes needed.

cpp/tensorrt_llm/kernels/moePrepareKernels.h (3)

43-48: FIFO structure properly adapted for multi-value storage.

The MoeCommFifoConnInfo structure now uses a 512-element values array instead of a single count, supporting the per-slot statistics workflow. The volatile qualifier ensures proper memory visibility across threads.


55-66: Simplified pointer arithmetic in getFifoConnInfo.

The method now uses workspacePtr directly as the base pointer, simplifying the memory layout. The calculation correctly handles sender/receiver perspective and peer ranking.


71-73: Updated function signature aligns with implementation.

The computeCountAndIndice declaration properly includes the new expertStatics, gatheredExpertStatics, slotCount, and expertCount parameters, matching the implementation in the .cu file.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15561 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11720 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
tests/unittest/_torch/thop/test_moe_alltoall.py (1)

319-435: Duplicate FP8 test class — remove the second definition

TestMoeAlltoAllFP8SingleGPU is defined twice. This will shadow the first definition and can lead to confusion and flakiness in test discovery. Keep only one copy.

Apply this diff to remove the duplicate block starting at Line 919:

- class TestMoeAlltoAllFP8SingleGPU(unittest.TestCase):
-
-     def setUp(self):
-         torch.manual_seed(0x1234)
-         tllm.logger.set_level('error')
-
-     def test_moe_alltoall_fp8_with_indices(self):
-         """Test fp8 alltoall with properly constructed indices"""
-         torch.cuda.set_device(0)
- 
-         # Match dimensions from the error
-         input_entry_count = 16384
-         output_entry_count = 16384
-         vector_dim = 2944
-         sf_vector_dim = 92  # Scaling factor dimension from error
-         send_recv_count = 1000  # Number of entries to send/receive
- 
-         # Create input tensors - first as float16, then convert
-         input_tensor_fp16 = torch.randn(input_entry_count,
-                                         vector_dim,
-                                         dtype=torch.float16,
-                                         device='cuda')
-         input_tensor_fp8 = input_tensor_fp16.to(torch.float8_e4m3fn)
- 
-         # Scaling factor tensor
-         input_sf_tensor = torch.randint(1,
-                                         255, (input_entry_count, sf_vector_dim),
-                                         dtype=torch.uint8,
-                                         device='cuda')
- 
-         # Expert selection tensors
-         input_experts = torch.randint(0,
-                                       64, (input_entry_count, 4),
-                                       dtype=torch.int32,
-                                       device='cuda')
-         input_scales = torch.rand(input_entry_count,
-                                   4,
-                                   dtype=torch.float32,
-                                   device='cuda')
- 
-         # Output tensors
-         output_tensor_fp8 = torch.zeros(output_entry_count,
-                                         vector_dim,
-                                         dtype=torch.float8_e4m3fn,
-                                         device='cuda')
-         output_sf_tensor = torch.zeros(output_entry_count,
-                                        sf_vector_dim,
-                                        dtype=torch.uint8,
-                                        device='cuda')
-         output_experts = torch.zeros(output_entry_count,
-                                      4,
-                                      dtype=torch.int32,
-                                      device='cuda')
-         output_scales = torch.zeros(output_entry_count,
-                                     4,
-                                     dtype=torch.float32,
-                                     device='cuda')
- 
-         # Construct send/recv indices
-         send_cumsum = torch.tensor([send_recv_count],
-                                    dtype=torch.int32,
-                                    device='cuda')
-         recv_cumsum = torch.tensor([send_recv_count],
-                                    dtype=torch.int32,
-                                    device='cuda')
- 
-         # Random indices for sending
-         send_indices = torch.randperm(input_entry_count,
-                                       dtype=torch.int32,
-                                       device='cuda')[:send_recv_count]
-         recv_indices = torch.randperm(output_entry_count,
-                                       dtype=torch.int32,
-                                       device='cuda')[:send_recv_count]
- 
-         # Create workspace
-         workspace_size = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(1)
-         all_workspaces = torch.zeros(1,
-                                      workspace_size,
-                                      dtype=torch.uint64,
-                                      device='cuda')
- 
-         print(f"Test configuration:")
-         print(f"  Input entries: {input_entry_count}")
-         print(f"  Vector dim: {vector_dim}")
-         print(f"  SF vector dim: {sf_vector_dim}")
-         print(f"  Send/recv count: {send_recv_count}")
-         print(f"  FP8 tensor shape: {input_tensor_fp8.shape}")
-         print(f"  SF tensor shape: {input_sf_tensor.shape}")
- 
-         try:
-             # Test with all 4 tensors
-             torch.ops.trtllm.moe_comm([
-                 input_tensor_fp8, input_sf_tensor, input_experts, input_scales
-             ], send_cumsum, send_indices, [
-                 output_tensor_fp8, output_sf_tensor, output_experts,
-                 output_scales
-             ], recv_cumsum, recv_indices, all_workspaces, 0, 1)
- 
-             torch.cuda.synchronize()
-             print("FP8 alltoall test PASSED!")
- 
-             # Verify outputs
-             print(f"\nOutput verification:")
-             print(f"  Output FP8 shape: {output_tensor_fp8.shape}")
-             print(f"  Output SF shape: {output_sf_tensor.shape}")
-             print(
-                 f"  Non-zero FP8 elements: {(output_tensor_fp8 != 0).sum().item()}"
-             )
-             print(
-                 f"  Non-zero SF elements: {(output_sf_tensor != 0).sum().item()}"
-             )
- 
-         except Exception as e:
-             print(f"FP8 alltoall test FAILED: {e}")
-             print(f"Error type: {type(e)}")
-             raise

Also applies to: 919-1035

cpp/tensorrt_llm/thop/moeCommOp.cpp (1)

70-85: Bound field count to compile-time maximum to avoid OOB

Before filling fieldsInfo, enforce fieldCount <= MOE_COMM_FIELD_MAX_COUNT to avoid writing past the fixed-size array.

 int fieldCount = inputs.size();
+tensorrt_llm::kernels::FusedMoeFieldInfo sendFieldInfo, recvFieldInfo;
+TORCH_CHECK(
+    fieldCount <= tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT,
+    "Number of fields (", fieldCount, ") exceeds maximum allowed (",
+    tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, ")");
-sendFieldInfo.isBasicInterleaved = false;
+sendFieldInfo.isBasicInterleaved = false;

Note: include or ensure fusedMoeCommKernels.h exposes MOE_COMM_FIELD_MAX_COUNT (already included at Line 19).

🧹 Nitpick comments (5)
tests/unittest/_torch/thop/test_moe_alltoall.py (2)

117-126: Warmup path uses an uninitialized workspace

do_warmup allocates a fresh all_workspaces but does not call moe_initialize_workspace before moe_comm. Initialize it to mirror the main test path.

 all_workspaces = torch.zeros(1,
                              workspace_size // 8,
                              dtype=torch.uint64,
                              device=torch.device('cuda'))
+torch.ops.trtllm.moe_initialize_workspace(all_workspaces, 0, 1)
 torch.ops.trtllm.moe_comm(input_tensors, send_cumsum, send_indices,
                           output_tensors, recv_cumsum, recv_indices,
                           all_workspaces, 0, 1)

251-255: Reference assembly may be fragile across ranks; verify offset usage

output_indice_offset is accumulated across nested loops over other_rank. This is correct if each rank’s recv counts are appended contiguously per source rank, but it couples the reference construction to the same concatenation order you use later. If you change concatenation order, this will silently miscompare.

If you expect to change concatenation order in future, consider computing per-rank-per-source slices and then concatenating both actual and reference with the same order variable to avoid hidden coupling.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

190-198: Top-k divisibility guard contradicts fused path capability

enable_alltoall still enforces experts_per_token % 4 == 0, but the fused multi-field alltoall path is intended to support arbitrary top_k. This condition will prevent the new path from engaging for non-multiples-of-4.

If the fused kernels support arbitrary top_k, drop this guard:

 @cached_property
 def enable_alltoall(self):
-    return (self.mapping.moe_ep_size > self.routing_method.experts_per_token
-            and self.routing_method.experts_per_token % 4 == 0  # alltoall without allgather only supports top_k % 4 == 0
+    return (self.mapping.moe_ep_size > self.routing_method.experts_per_token
             and self.mapping.enable_attention_dp
             and self.mapping.tp_size > 1
             and os.environ.get("TRTLLM_MOE_DISABLE_ALLTOALLV", "0") != "1"
             and MnnvlMemory.supports_mnnvl())

If there are known exceptions (e.g., legacy/non-fused kernels), gate them by kernel selection rather than a global condition here.


245-249: No-op assignment for output_dtype

output_dtype = output_dtype in the isinstance(x, Fp4QuantizedTensor) branch is redundant.

 if isinstance(x, Fp4QuantizedTensor):
-    assert output_dtype is not None
-    output_dtype = output_dtype
+    assert output_dtype is not None
 else:
     output_dtype = x.dtype
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)

141-207: moePrepareOp shape/flow is sound; minor cleanups

The new API returns the cumsums and indices plus optional gathered statics, which aligns with the kernel changes. Two small cleanups:

  • Remove unused preparedLocalExpertIds allocation.
  • Consider asserting slotCount % epSize == 0 to match expertCountPerRank = slotCount / epSize assumption in the kernel.
-torch::Tensor preparedLocalExpertIds
-    = torch::empty({maxTokenCountPerRank * epSize, topK}, expertsIds.options().dtype(torch::kInt32));
+// preparedLocalExpertIds no longer used in the new API; removed.

+TORCH_CHECK(slotCount % epSize == 0, "slotCount must be divisible by epSize");
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 02bf032 and c1a1e71.

📒 Files selected for processing (5)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu (7 hunks)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h (3 hunks)
  • cpp/tensorrt_llm/thop/moeCommOp.cpp (6 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1 hunks)
  • tests/unittest/_torch/thop/test_moe_alltoall.py (8 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
🧠 Learnings (2)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🪛 Ruff (0.12.2)
tests/unittest/_torch/thop/test_moe_alltoall.py

919-919: Redefinition of unused TestMoeAlltoAllFP8SingleGPU from line 319

(F811)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/unittest/_torch/thop/test_moe_alltoall.py (1)

79-85: Unify workspace_size units across all FP8 tests

The allocation for all_workspaces must consistently use either the raw workspace_size (if it already represents the number of uint64 entries) or workspace_size // 8 (if it returns bytes). In the current file, four blocks divide by 8, but the last block (lines 994–996) does not, risking under/over‐allocation.

Please confirm the return units of:

  • torch.ops.trtllm.get_moe_commworkspace_size_per_rank
  • torch.ops.trtllm.get_moe_prepare_workspace_size_per_rank

Then standardize allocations in:

  • tests/unittest/_torch/thop/test_moe_alltoall.py
    • Lines 79–81
    • Lines 119–121
    • Lines 276–278
    • Lines 394–396
    • Lines 994–996

Suggested diff for the final block (lines 994–996):

- all_workspaces = torch.zeros(1,
-                              workspace_size,
+ all_workspaces = torch.zeros(1,
+                              workspace_size // 8,
                               dtype=torch.uint64,
                               device='cuda')

Apply the same convention to the get_moe_prepare_workspace_size_per_rank tests once confirmed.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

355-370: Alltoall prepare and fused dispatch look correct

The move to a single fused alltoallv for [x, x_sf, token_selected_experts, token_final_scales] with precomputed alltoall_info and post-comm memset of expert IDs matches the new API and reduces redundant traffic.

cpp/tensorrt_llm/kernels/moePrepareKernels.h (2)

43-48: Per-slot counter array size ties to runtime; check is enforced upstream

The change to int volatile values[512] matches the upstream TORCH_CHECK (expertCount + 1 <= 512) in moeCommOp.cpp. This is good to prevent OOB indexing.


55-66: Workspace base pointer arithmetic flattened to workspacePtr; OK but relies on correct stride units

getFifoConnInfo now derives the FIFO base using workspacePtr and rankStrideInU64. This assumes the Python side provides the stride in uint64 units. Ensure all callers allocate allWorkspaces second dimension in units of uint64s per rank, not bytes.

If any Python tests are dividing by 8 (bytes→u64), standardize all to the same convention and document it in the op’s Python docstring.

cpp/tensorrt_llm/kernels/moePrepareKernels.cu (3)

126-137: Sender-side stat emission is clear and bounded

The loop emits the send count (index 0) and, if provided, expert statics (indices 1..expertCount). This aligns with the 512-slot buffer and the upstream check.


160-174: Receiver-side stat acquisition: relies on proper acquire/release semantics

With the proposed CounterCommunicator fix, the recvCountAndStatics loop is robust. Without it, this spinning loop may not observe producer writes.

Please adopt the release/acquire diff above to prevent rare hangs under load.


332-335: Prepare workspace size returns bytes; ensure Python allocates in uint64 units

getMoePrepareWorkspaceSize returns sizeof(MoeCommFifoConnInfo) * epSize (bytes). Python currently passes allWorkspaces.stride(0) in units of uint64s. Allocate the second dimension as workspace_size / sizeof(uint64_t) to keep stride consistent.

I can provide a repo-wide patch to normalize prepare/comm workspace sizing once you confirm intended units for both APIs.

cpp/tensorrt_llm/thop/moeCommOp.cpp (1)

126-139: Workspace initialization path LGTM

moe_initialize_workspace builds FusedMoeWorkspace and initializes it with world info. This aligns with the fused comm path expectations.

@bobboli bobboli self-requested a review August 18, 2025 02:22
@dongxuy04
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (1)

60-77: Fix cross-CTA visibility and type mismatch in CounterCommunicator (use s32 acquire/release).

Plain volatile stores/loads across CTAs are not sufficient; readers can spin on stale zeros. Also, value is uint64_t while values[] is typically int (s32). Use release/acquire semantics on s32 and keep types consistent.

Apply this diff:

-    __forceinline__ __device__ void releaseValue(uint64_t value, int index)
+    __forceinline__ __device__ void releaseValue(int value, int index)
     {
         // Avoid block on 0
-        fifoConnInfo->values[index] = value + 1;
+        int v = value + 1;
+        // Release store to make writes visible across CTAs/SMs
+        asm volatile("st.release.sys.global.s32 [%0], %1;" :: "l"(&fifoConnInfo->values[index]), "r"(v) : "memory");
     }
 
-    __forceinline__ __device__ uint64_t acquireValue(int index)
+    __forceinline__ __device__ int acquireValue(int index)
     {
-        uint64_t localValue = 0;
+        int localValue = 0;
         do
         {
-            localValue = fifoConnInfo->values[index];
+            // Acquire load to observe prior release
+            asm volatile("ld.acquire.sys.global.s32 %0, [%1];" : "=r"(localValue) : "l"(&fifoConnInfo->values[index]) : "memory");
         } while (localValue == 0);
 
         fifoConnInfo->values[index] = 0; // reset the value
 
         return localValue - 1;
     }

Note: Call sites already operate on ints; this change aligns types and ordering.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

46-51: Tighten architecture guards: require CUDA_ARCH presence and use parentheses.

You changed from the prior incorrect || to && (good), but the recommended guard is #if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= X) rather than defined(__CUDACC__) && __CUDA_ARCH__ >= X. The latter is redundant and risks subtle host-compilation parsing.

Apply representative diffs (repeat for all guarded blocks):

-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
   asm("mbarrier.init.shared.b64 [%0], %1;" : : "r"(__as_ptr_smem(addr)), "r"(count) : "memory");
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
   asm("mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" : : "r"(__as_ptr_smem(addr)), "r"(txCount) : "memory");
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
   uint64_t state;
   asm("mbarrier.arrive.shared.b64 %0, [%1];" : "=l"(state) : "r"(__as_ptr_smem(addr)) : "memory");
   return state;
 #else
   return 0;
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
   uint64_t state;
   asm("mbarrier.arrive.expect_tx.release.cta.shared::cta.b64 %0, [%1], %2;" : "=l"(state) : "r"(__as_ptr_smem(addr)), "r"(txCount) : "memory");
   return state;
 #else
   return 0;
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
   uint32_t waitComplete;
   asm("{\n\t .reg .pred P_OUT; \n\t"
       "mbarrier.try_wait.parity.shared::cta.b64  P_OUT, [%1], %2;\n\t"
       "selp.b32 %0, 1, 0, P_OUT; \n"
       "}" : "=r"(waitComplete) : "r"(__as_ptr_smem(addr)), "r"(phaseParity) : "memory");
   return static_cast<bool>(waitComplete);
 #else
   return false;
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
   asm volatile(
       "{\n"
       "  .reg .pred p;\n"
       "  setp.ne.b32 p, %0, 0;\n"
       "  @p cp.async.ca.shared.global [%1], [%2], %3;\n"
       "}\n" ::"r"((int) predGuard), "r"(__as_ptr_smem(dstShm)), "l"(__as_ptr_gmem(srcMem)), "n"(COPY_SIZE));
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
   asm volatile("cp.async.commit_group;" : : :);
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
   asm volatile("cp.async.wait_group %0;" : : "n"(N) : "memory");
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
   asm("cp.async.bulk.shared::cta.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];"
       : : "r"(__as_ptr_smem(dstMem)), "l"(__as_ptr_gmem(srcMem)), "r"(copySize), "r"( __as_ptr_smem(smemBar)) : "memory");
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
   asm("cp.async.bulk.global.shared::cta.bulk_group [%0], [%1], %2;" : : "l"(__as_ptr_gmem(dstMem)), "r"(__as_ptr_smem(srcMem)), "r"(copySize) : "memory");
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
   asm volatile("cp.async.bulk.commit_group;" : : :);
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
   asm volatile("cp.async.bulk.wait_group %0;" : : "n"(N) : "memory");
 #endif
@@
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
   asm volatile("cp.async.bulk.wait_group.read %0;" : : "n"(N) : "memory");
 #endif

Also applies to: 53-61, 63-73, 74-86, 88-103, 105-117, 119-124, 126-132, 134-142, 144-152, 154-159, 161-167, 169-175

🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (1)

93-94: Confirm slotCount divisibility by epSize.

expertCountPerRank = slotCount / epSize assumes even partitioning; any remainder will corrupt rank mapping when doing experts[...] / expertCountPerRank. Validate this at the host before launching the kernel.

Add a host-side precondition in computeCountAndIndice:

 void computeCountAndIndice( /* args... */ )
 {
+    TLLM_CHECK_WITH_INFO(slotCount % rankCount == 0, "slotCount (%d) must be divisible by epSize (%d).",
+                         slotCount, rankCount);
     // existing code ...
 }
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)

1298-1303: Static caching of max dynamic SMEM may be device-dependent.

static int maxDynamicShmSize = computeMoeAlltoallMaxDynamicSharedMemorySize(); will cache the first device’s value. If the active device can change, recompute per call or per device.

Apply this diff:

-    static int maxDynamicShmSize = fused_moe_impl::computeMoeAlltoallMaxDynamicSharedMemorySize();
-    int groupCountPerCta = std::min(maxGroupCountPerCta, maxDynamicShmSize / warpShmSize);
+    int maxDynamicShmSize = fused_moe_impl::computeMoeAlltoallMaxDynamicSharedMemorySize();
+    int groupCountPerCta = std::min(maxGroupCountPerCta, maxDynamicShmSize / warpShmSize);

1322-1326: Optional: avoid magic number 48*1024.

Use a named constexpr (e.g., kDEFAULT_SMEM_PER_BLOCK = 48 * 1024) or query device attribute to decide when to opt-in.

-    if (groupCountPerCta * warpShmSize > 48 * 1024)
+    constexpr int kDEFAULT_SMEM_PER_BLOCK = 48 * 1024;
+    if (groupCountPerCta * warpShmSize > kDEFAULT_SMEM_PER_BLOCK)
     {
         TLLM_CUDA_CHECK(cudaFuncSetAttribute(
             kernelFn, cudaFuncAttributeMaxDynamicSharedMemorySize, groupCountPerCta * warpShmSize));
     }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c1a1e71 and b47166c.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu (7 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
🧠 Learnings (2)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-14T15:36:37.576Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/kernels/mlaKernels.cu:436-439
Timestamp: 2025-08-14T15:36:37.576Z
Learning: CUDA kernels prioritize performance and should avoid runtime bounds checking or conditional operations that cause branching/warp divergence. Input validation should be done at the host level before kernel launch, not per-thread in the kernel.

Applied to files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (5)

126-137: Use consistent per-slot send-count semantics and avoid data races.

This producer writes both the per-target sendCounts[targetRankId] and publishes values through the FIFO. Once CounterCommunicator uses release semantics (previous comment), the consumer will see consistent values. Without those semantics, the consumer may read zero or stale values. Ensure the acquire/release fix is applied before relying on these values.


140-143: Consumer needs acquire semantics and explicit reset ordering.

acquireValue currently does plain loads and resets the slot. Once changed to acquire-load (previous comment), the recvValue path is fine. Keep the reset as a relaxed store after consumption as you have it. No additional barriers needed here.

Also applies to: 160-174


267-271: Kernel specialization and launch selection look good.

TopK tiling and function pointer selection is clear and bounded by 1 <= topK <= 32. Grid sizing for senders vs receivers is coherent.

Also applies to: 300-302


322-329: memsetExpertIds now writes to expertIds (not recvExperts) — verify downstream assumptions.

Switching the destination to expertIds looks intentional for padding with slotCount. Ensure downstream kernels and host code treat slotCount as an out-of-range sentinel and not a valid slot.


331-334: Verify MoeCommFifoConnInfo layout before sizing workspace

File: cpp/tensorrt_llm/kernels/moePrepareKernels.cu
Lines: 331-334

size_t getMoePrepareWorkspaceSize(int epSize)
{
    return sizeof(MoeCommFifoConnInfo) * epSize;
}

Manual validation required:
Sizing the workspace as sizeof(MoeCommFifoConnInfo) * epSize only covers the struct’s header. It assumes that:

  • MoeCommFifoConnInfo::values is an in-place, fixed-size array large enough for communicationCount = expertCount + 1 slots, and
  • All calls to getFifoConnInfo(..., slotIndex, slotCount) use slotIndex within that array’s bounds.

I wasn’t able to locate a local definition of MoeCommFifoConnInfo—please confirm that both conditions hold. If values is instead a flexible array member or externally allocated, update this function to include the full per-slot storage, for example:

sizeof(MoeCommFifoConnInfo) + sizeof(ValueType) * (expertCount + 1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (3)

352-361: Verify LL128 finish-sentinel logic.

finish = (value == (step & (1ULL << 63ULL))) compares against either 0 or 0x8000... depending on step’s MSB, but I don’t see a write path that stores such a sentinel. If finish packets are intended, clarify the encoding (e.g., value == (step | (1ULL << 63)) or a dedicated tag) and add the matching write in protoPack or the sender path.

Would you like me to draft a consistent finish-sentinel scheme for LL128 and wire it through pack/unpack?


892-901: Receiver updates both sender and receiver tails — confirm intended ownership.

updateReadEntry() writes mReceiverSideFifoInfo->tail and mSenderSideFifoInfo->tail. If sender-side tail is supposed to be updated only by the sender, this can cause contention or invariants violations in multi-SM/multi-CTA settings. If both infos are colocated in the same address space (loopback/local), this may be deliberate. Please confirm.


1616-1693: Local FIFO test harness is clear and useful.

The local send/recv harness parameterization (HAS_BASIC_FIELD template, dynamic shared memory sizing, index mappings) looks solid and should aid regression testing.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15571 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15571 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11726 completed with status: 'FAILURE'

@dongxuy04 dongxuy04 force-pushed the user/dongxuy/a2a_opt branch from b47166c to c9360c9 Compare August 18, 2025 07:14
@dongxuy04
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15609 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)

174-183: Validate expertsStatics when provided

If expertsStatics is present, ensure dtype and shape are as expected before passing raw pointers into kernels.

-    if (expertsStatics.has_value())
+    if (expertsStatics.has_value())
     {
+        CHECK_INPUT(expertsStatics.value(), torch::kInt32);
+        TORCH_CHECK(
+            expertsStatics.value().dim() == 1 && expertsStatics.value().size(0) == expertCount,
+            "experts_statics must be a 1D int32 tensor of length expert_count (", expertCount, ")");
         localExpertStaticsPtr = expertsStatics.value().data_ptr<int>();
         gatheredExpertStatics = torch::empty({epSize, expertCount}, expertsIds.options().dtype(torch::kInt32));
         gatheredExpertStaticsPtr = gatheredExpertStatics.value().data_ptr<int>();
     }
♻️ Duplicate comments (9)
tests/unittest/_torch/thop/test_moe_alltoall.py (1)

319-435: Remove duplicate test class definition.

The TestMoeAlltoAllFP8SingleGPU class is defined twice (lines 319 and 919), which will cause a redefinition error. The second definition should be removed.

Remove the duplicate class definition starting at line 919:

-class TestMoeAlltoAllFP8SingleGPU(unittest.TestCase):
-
-    def setUp(self):
-        torch.manual_seed(0x1234)
-        tllm.logger.set_level('error')
-
-    def test_moe_alltoall_fp8_with_indices(self):
-        """Test fp8 alltoall with properly constructed indices"""
-        torch.cuda.set_device(0)
-        
-        # Match dimensions from the error
-        input_entry_count = 16384
-        output_entry_count = 16384
-        vector_dim = 2944
-        sf_vector_dim = 92  # Scaling factor dimension from error
-        send_recv_count = 1000  # Number of entries to send/receive
-        
-        # ... rest of the duplicate implementation ...

Also applies to: 919-1035

tensorrt_llm/_mnnvl_utils.py (1)

529-599: Fix return type inconsistency.

When x is a single tensor input, the function converts it to a list for unified handling. However, at line 597, when returning for a single tensor case, it returns [result[0]] which is still a list containing one element. According to the function signature -> Union[torch.Tensor, List[torch.Tensor]], it should return a single tensor when the input was a single tensor.

Apply this diff to fix the return type:

-        # If input was a single tensor, return a single tensor
+        # If input was a single tensor, return a single tensor
         if is_single_tensor:
-            result = [result[0]]
+            result = result[0]
 
         return result

Additionally, update the type hints to reflect support for None entries:

-def mnnvl_moe_alltoallv(
-        x: Union[torch.Tensor, List[torch.Tensor]],
+def mnnvl_moe_alltoallv(
+        x: Union[torch.Tensor, List[Optional[torch.Tensor]]],
         alltoall_info: MoEAlltoallInfo,
         workspace: torch.Tensor,
         ep_rank: int,
         ep_size: int,
-    ) -> Union[torch.Tensor, List[torch.Tensor]]:
+    ) -> Union[torch.Tensor, List[Optional[torch.Tensor]]]:
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (1)

60-77: Memory ordering concerns with volatile operations.

The CounterCommunicator class uses plain volatile stores/loads for inter-CTA communication. While the diff shows type consistency with uint64_t values and int storage, the lack of proper memory ordering could lead to visibility issues where CTAs spin on stale values.

Consider using proper release/acquire semantics:

 __forceinline__ __device__ void releaseValue(uint64_t value, int index)
 {
     // Avoid block on 0
-    fifoConnInfo->values[index] = value + 1;
+    int v = static_cast<int>(value + 1);
+    // Release store to make writes visible across CTAs
+    asm volatile("st.release.sys.global.s32 [%0], %1;" :: "l"(&fifoConnInfo->values[index]), "r"(v) : "memory");
 }

 __forceinline__ __device__ uint64_t acquireValue(int index)
 {
-    uint64_t localValue = 0;
+    int localValue = 0;
     do
     {
-        localValue = fifoConnInfo->values[index];
+        // Acquire load to observe prior release
+        asm volatile("ld.acquire.sys.global.s32 %0, [%1];" : "=r"(localValue) : "l"(&fifoConnInfo->values[index]));
     } while (localValue == 0);

     fifoConnInfo->values[index] = 0; // reset the value
-    return localValue - 1;
+    return static_cast<uint64_t>(localValue - 1);
 }
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

46-175: Fix architecture guards to prevent unsupported PTX on lower compute capabilities.

The current preprocessor guards use #if defined(__CUDACC__) && __CUDA_ARCH__ >= XXX, but this is incorrect. The defined(__CUDACC__) check is for host-side compilation detection, not device-side. This could allow unsupported PTX instructions to be compiled for lower compute capabilities.

Apply these fixes to ensure proper architecture guards:

 __device__ __forceinline__ void mbarrier_init(uint64_t* addr, uint32_t const& count)
 {
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 800
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
     asm("mbarrier.init.shared.b64 [%0], %1;" : : "r"(__as_ptr_smem(addr)), "r"(count) : "memory");
 #endif
 }

 __device__ __forceinline__ void mbarrier_expect_tx(uint64_t* addr, const uint32_t txCount)
 {
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 900
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
     asm("mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;"
         :
         : "r"(__as_ptr_smem(addr)), "r"(txCount)
         : "memory");
 #endif
 }

Apply similar fixes to all other architecture-gated functions (mbarrier_arrive, mbarrier_arrive_expect_tx, mbarrier_try_wait_parity, ldgsts, cp_async_commit_group, cp_async_wait_group, cp_async_bulk_g2s, cp_async_bulk_s2g, cp_async_bulk_commit_group, cp_async_bulk_wait_group, cp_async_bulk_wait_group_read).

cpp/tensorrt_llm/thop/moeCommOp.cpp (3)

209-227: Fix misleading error text and strengthen guards in memsetExpertIds

The dim check says “1D” while enforcing 2D. Also add basic param guards.

 void memsetExpertIds(torch::Tensor expertsIds, torch::Tensor recvRankCountCumSum, int64_t maxTokenCountPerRank,
     int64_t topK, int64_t slotCount, int64_t epSize)
 {
     CHECK_INPUT(expertsIds, torch::kInt32);
-    TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 1D tensor");
+    TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 2D tensor [max_token_count_per_rank*ep_size, top_k]");
     TORCH_CHECK(
         expertsIds.size(0) == maxTokenCountPerRank * epSize, "expertsIds must have maxTokenCountPerRank * epSize rows");
     TORCH_CHECK(expertsIds.size(1) == topK, "expertsIds must have topK columns");
 
     CHECK_INPUT(recvRankCountCumSum, torch::kInt32);
     TORCH_CHECK(recvRankCountCumSum.dim() == 1, "recvRankCountCumSum must be a 1D tensor");
     TORCH_CHECK(recvRankCountCumSum.size(0) == epSize, "recvRankCountCumSum must have epSize elements");
+    TORCH_CHECK(maxTokenCountPerRank > 0 && topK > 0 && slotCount > 0 && epSize > 0,
+        "max_token_count_per_rank, top_k, slot_count, ep_size must be positive");
 
     auto stream = at::cuda::getCurrentCUDAStream();

70-76: Enforce fieldCount ≤ MOE_COMM_FIELD_MAX_COUNT to avoid out-of-bounds on fieldsInfo[]

fieldsInfo has a compile-time max (8). Guard fieldCount to prevent OOB writes.

-    int fieldCount = inputs.size();
+    int fieldCount = inputs.size();
+    TORCH_CHECK(
+        fieldCount <= tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT,
+        "Number of fields (", fieldCount, ") exceeds maximum allowed (",
+        tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, ")");

95-99: topK is not initialized; pass 0 explicitly (no basic fields) to fillMetaInfo

params.expertParallelInfo.topK is never set. Passing its default to fillMetaInfo is fragile; for fused comm here we don’t use basic fields. Pass 0 explicitly and set the struct for clarity.

     params.recvFieldInfo = recvFieldInfo;
     // Do not need expertParallelInfo for fused moe comm now
+    params.expertParallelInfo.topK = 0;
 
-    params.sendFieldInfo.fillMetaInfo(&(params.sendCommMeta), params.expertParallelInfo.topK, false, false);
-    params.recvFieldInfo.fillMetaInfo(&(params.recvCommMeta), params.expertParallelInfo.topK, false, false);
+    params.sendFieldInfo.fillMetaInfo(&(params.sendCommMeta), /*topK=*/0, false, false);
+    params.recvFieldInfo.fillMetaInfo(&(params.recvCommMeta), /*topK=*/0, false, false);
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (2)

1-15: Switch to SPDX header format (consistent with project headers)

Replace the legacy Apache header with SPDX lines as used elsewhere.

-/*
- * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */

148-151: random_shuffle removed in C++17; use std::shuffle with a real RNG

This will not compile under C++17. Replace with std::shuffle and drop the extra std::srand.

         // Shuffle unused indices for random assignment
-        std::srand(42); // Fixed seed for reproducible tests
-        std::random_shuffle(unusedIndices.begin(), unusedIndices.end());
+        std::mt19937 gen(42); // Fixed seed for reproducible tests
+        std::shuffle(unusedIndices.begin(), unusedIndices.end(), gen);
🧹 Nitpick comments (10)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

465-475: Clarify statistics gathering on the last call

The implementation accumulates local expert statistics on every chunk (and repeat) via update_local_statistic (when updates_enabled), storing counts in local_statistic_tensor, and only calls get_local_statistic_tensor when is_last_call to retrieve the fully aggregated tensor for the subsequent AlltoAll. This avoids redundant communications while ensuring a complete view across all repeats. Static routing (is_static_routing == True) correctly skips both updates and gathering.

• File tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py, around line 465:
Add a comment above the if is_last_call and … block explaining that

  1. local_statistic_tensor has accumulated stats from all prior calls, and
  2. gathering is deferred to the final call by design to minimize communication overhead.

No changes to the logic itself are required.

cpp/tensorrt_llm/thop/moeCommOp.cpp (3)

81-88: Optional: Validate each input/output tensor is CUDA and contiguous

setMoeCommFieldInfo asserts 2D shape, but neither CUDA placement nor contiguity are asserted. A bad placement/stride can miscompute element sizes and lead to memory faults at kernel time.

     for (int i = 0; i < fieldCount; i++)
     {
-        setMoeCommFieldInfo(sendFieldInfo.fieldsInfo[i], inputs[i]);
-        setMoeCommFieldInfo(recvFieldInfo.fieldsInfo[i], outputs[i]);
+        TORCH_CHECK(inputs[i].is_cuda() && outputs[i].is_cuda(), "inputs/outputs must be CUDA tensors");
+        TORCH_CHECK(inputs[i].is_contiguous() && outputs[i].is_contiguous(), "inputs/outputs must be contiguous");
+        setMoeCommFieldInfo(sendFieldInfo.fieldsInfo[i], inputs[i]);
+        setMoeCommFieldInfo(recvFieldInfo.fieldsInfo[i], outputs[i]);
     }

153-155: Remove unused preparedLocalExpertIds allocation

This tensor is allocated but never used. Drop it to save memory and avoid confusion.

-    torch::Tensor preparedLocalExpertIds
-        = torch::empty({maxTokenCountPerRank * epSize, topK}, expertsIds.options().dtype(torch::kInt32));
+    // preparedLocalExpertIds removed: not used in fused prepare workflow

157-158: Nit: Use lowerCamelCase for local variable names

RecvRankCountCumSum uses uppercased initial letter, while other locals are lowerCamelCase. Consider recvRankCountCumSum for consistency.

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (6)

49-58: Avoid C RNG in tests; don’t seed global state

Use a local RNG to avoid global state and improve reproducibility. Remove std::srand here.

     void SetUp() override
     {
         if (shouldSkip())
         {
             skipped = true;
             GTEST_SKIP() << "Skipping due to no/unsupported GPU";
         }
         TLLM_CUDA_CHECK(cudaStreamCreate(&stream));
-        std::srand(42); // Initialize random seed
     }

68-70: Add a per-fixture RNG

Use a member engine to power all random data generation paths.

     bool skipped = false;
     cudaStream_t stream = nullptr;
+    std::mt19937 rng{42}; // Fixed seed for reproducible tests

88-104: Replace rand() with distributions

rand() is non-deterministic across platforms and discouraged. Use the fixture RNG with type-appropriate distributions.

         else
         {
             // Default initialization with random values
-            for (size_t i = 0; i < count; i++)
-            {
-                if constexpr (std::is_same_v<T, float>)
-                {
-                    (*hostPtr)[i] = static_cast<float>(rand()) / RAND_MAX * 10.0f;
-                }
-                else if constexpr (std::is_same_v<T, int>)
-                {
-                    (*hostPtr)[i] = rand() % 1000;
-                }
-                else
-                {
-                    (*hostPtr)[i] = static_cast<T>(rand() % 100);
-                }
-            }
+            std::uniform_real_distribution<float> floatDist(0.0f, 10.0f);
+            std::uniform_int_distribution<int> intDist(0, 999);
+            std::uniform_int_distribution<int> byteDist(0, 99);
+            for (size_t i = 0; i < count; i++)
+            {
+                if constexpr (std::is_same_v<T, float>)
+                {
+                    (*hostPtr)[i] = floatDist(rng);
+                }
+                else if constexpr (std::is_same_v<T, int>)
+                {
+                    (*hostPtr)[i] = intDist(rng);
+                }
+                else
+                {
+                    (*hostPtr)[i] = static_cast<T>(byteDist(rng));
+                }
+            }
         }

153-163: Handle exhausted unusedIndices and fix signed/unsigned mismatch

Prevent leaving -1 in fullMapping if unusedIndices runs out; also use size_t for the index to avoid signed/unsigned warnings.

-        int unusedIdx = 0;
+        size_t unusedIdx = 0;
         for (int i = 0; i < totalSize; i++)
         {
             if (i < providedSize && fullMapping[i] == -1)
             {
                 // Fix invalid mapping
-                if (unusedIdx < unusedIndices.size())
+                if (unusedIdx < unusedIndices.size())
                 {
                     fullMapping[i] = unusedIndices[unusedIdx++];
                 }
+                else
+                {
+                    // No more unused indices available, use identity as fallback
+                    fullMapping[i] = i;
+                }
             }
             else if (i >= providedSize)
             {
                 // Extend mapping
                 if (unusedIdx < unusedIndices.size())
                 {
                     fullMapping[i] = unusedIndices[unusedIdx++];
                 }
                 else
                 {
                     // Fallback: identity mapping for remaining
                     fullMapping[i] = i;
                 }
             }

473-476: Avoid strict-aliasing UB when packing float into int

Use memcpy instead of pointer punning.

             else if (hasScales && offsetInWarp < topK * 2)
             {
                 // Scales area
-                float scale
-                    = 1.0f + static_cast<float>(warpIdx) * 0.1f + static_cast<float>(offsetInWarp - topK) * 0.01f;
-                return *reinterpret_cast<int*>(&scale);
+                float scale = 1.0f + static_cast<float>(warpIdx) * 0.1f
+                    + static_cast<float>(offsetInWarp - topK) * 0.01f;
+                int packed;
+                std::memcpy(&packed, &scale, sizeof(packed));
+                return packed;
             }

757-767: Optional: Prefer RAII for host buffers to simplify cleanup

Using std::vector or smart pointers prevents leaks on early returns and is clearer.

-        if (resultRecvTokenSlots)
-            delete[] resultRecvTokenSlots;
-        if (resultRecvScales)
-            delete[] resultRecvScales;
-        for (int i = 0; i < fieldCount; i++)
-        {
-            if (resultRecvFields[i])
-                delete[] resultRecvFields[i];
-        }
+        std::unique_ptr<int[]> resultRecvTokenSlotsPtr(resultRecvTokenSlots);
+        std::unique_ptr<float[]> resultRecvScalesPtr(resultRecvScales);
+        for (int i = 0; i < fieldCount; i++)
+        {
+            std::unique_ptr<uint8_t[]> fieldGuard(resultRecvFields[i]);
+        }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b47166c and c9360c9.

📒 Files selected for processing (16)
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/moeCommKernels.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/moeCommKernels.h (0 hunks)
  • cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1 hunks)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu (7 hunks)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h (3 hunks)
  • cpp/tensorrt_llm/thop/moeCommOp.cpp (6 hunks)
  • cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp (0 hunks)
  • cpp/tests/unit_tests/kernels/CMakeLists.txt (1 hunks)
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (1 hunks)
  • tensorrt_llm/_mnnvl_utils.py (5 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5 hunks)
  • tests/unittest/_torch/thop/test_moe_alltoall.py (8 hunks)
💤 Files with no reviewable changes (4)
  • cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/kernels/moeCommKernels.cu
  • cpp/tensorrt_llm/kernels/moeCommKernels.h
🚧 Files skipped from review as they are similar to previous changes (5)
  • cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
  • cpp/tests/unit_tests/kernels/CMakeLists.txt
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • tensorrt_llm/_mnnvl_utils.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
  • tensorrt_llm/_mnnvl_utils.py
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
🧠 Learnings (4)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
📚 Learning: 2025-08-14T15:36:37.576Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/kernels/mlaKernels.cu:436-439
Timestamp: 2025-08-14T15:36:37.576Z
Learning: CUDA kernels prioritize performance and should avoid runtime bounds checking or conditional operations that cause branching/warp divergence. Input validation should be done at the host level before kernel launch, not per-thread in the kernel.

Applied to files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/thop/moeCommOp.cpp
🧬 Code Graph Analysis (6)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (10)
  • uint64_t (494-499)
  • int (364-384)
  • BYTES_PER_128B_BLOCK (69-107)
  • SenderSideFifoInfo (30-34)
  • ReceiverSideFifoInfo (36-40)
  • getLaunchBlockDim (315-318)
  • getLaunchGridDim (320-329)
  • setMaxUsableSmCount (265-277)
  • getMoeCommChannelCount (302-313)
  • computeWorkspaceSizePreRank (520-526)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (3)
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1)
  • kernels (22-46)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
  • kernels (27-207)
  • computeWorkspaceSizePreRank (520-526)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (12)
  • tokenCount (919-1003)
  • tokenCount (919-919)
  • tokenCount (1015-1126)
  • tokenCount (1015-1015)
  • launchSingleG2S (1439-1454)
  • launchSingleG2S (1439-1440)
  • launchSingleS2G (1492-1507)
  • launchSingleS2G (1492-1493)
  • launchLoopback (1599-1620)
  • launchLoopback (1599-1601)
  • launchLocalFifoSendRecv (1661-1693)
  • launchLocalFifoSendRecv (1661-1664)
cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (8)
  • constructWorkspace (1369-1374)
  • constructWorkspace (1369-1369)
  • moeAllToAll (1290-1352)
  • moeAllToAll (1290-1290)
  • getFusedMoeCommWorkspaceSize (1362-1367)
  • getFusedMoeCommWorkspaceSize (1362-1362)
  • initializeFusedMoeLocalWorkspace (1376-1379)
  • initializeFusedMoeLocalWorkspace (1376-1376)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (2)
  • memsetExpertIds (322-329)
  • memsetExpertIds (322-323)
tensorrt_llm/_mnnvl_utils.py (2)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_size (46-47)
tensorrt_llm/python_plugin.py (4)
  • workspace (331-333)
  • workspace (336-338)
  • shape (246-247)
  • shape (250-255)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (3)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)
  • topK (187-286)
  • topK (187-189)
  • topK (334-455)
  • topK (334-336)
  • topK (533-658)
  • topK (533-535)
  • topK (987-1139)
  • topK (987-990)
cpp/include/tensorrt_llm/common/cudaUtils.h (1)
  • getMultiProcessorCount (393-400)
cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)
  • MoeCommFifoConnInfo (43-48)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_mnnvl_utils.py (8)
  • MnnvlMemory (53-336)
  • initialize (91-100)
  • MnnvlMoe (350-628)
  • get_moe_workspaces (358-373)
  • get_moe_prepare_workspace (376-387)
  • mnnvl_moe_alltoallv_prepare_without_allgather (399-443)
  • MoEAlltoallInfo (340-347)
  • mnnvl_moe_alltoallv (528-599)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (3)
  • is_static_routing (350-351)
  • is_static_routing (759-761)
  • get_local_statistic_tensor (563-577)
🪛 Ruff (0.12.2)
tests/unittest/_torch/thop/test_moe_alltoall.py

919-919: Redefinition of unused TestMoeAlltoAllFP8SingleGPU from line 319

(F811)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

465-465: Line too long (124 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (25)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5)

196-201: LGTM! Workspace initialization logic is well-structured.

The workspace initialization for MNNVL mode now unconditionally retrieves both the main workspace and prepare workspace, which aligns with the two-stage alltoall flow introduced in this PR.


298-301: LGTM! Simplified MNNVL alltoall enablement.

The method now always returns True for MNNVL mode, which is consistent with the new requirement that MNNVL always uses the alltoall path.


580-585: LGTM! Clean two-stage alltoall dispatch implementation.

The dispatch phase correctly passes the required parameters including the alltoall_info from the prepare stage, maintaining the two-stage flow architecture.


867-878: LGTM! Well-structured alltoall_prepare method.

The method correctly delegates to the MNNVL prepare function and returns the required tuple of (token_selected_slots, gathered_local_statistic_tensor, alltoall_info) for the two-stage flow.


879-895: LGTM! Complete alltoall_dispatch implementation.

The dispatch method properly handles the alltoall operation and updates expert IDs using the new memset_expert_ids operation, completing the two-stage flow.

tests/unittest/_torch/thop/test_moe_alltoall.py (5)

81-84: Consistent workspace sizing across test cases.

The workspace initialization now uses workspace_size // 8 for uint64 allocation, which is consistent with the new multi-tensor support. The subsequent call to moe_initialize_workspace properly initializes the workspace for rank 0.


86-88: LGTM! Correct multi-tensor API usage.

The test correctly wraps input and output tensors in lists to match the new multi-tensor API, ensuring backward compatibility for single-tensor cases.


117-126: Consistent multi-tensor handling in warmup.

The warmup function correctly uses the list-based API for tensors, maintaining consistency with the main test logic.


160-180: Well-structured multi-tensor test setup.

The test properly creates separate tensors for each dimension in vector_dims, preparing input, output, and reference tensors for each, which thoroughly exercises the multi-tensor path.


282-284: Proper per-rank workspace initialization.

Each rank's workspace is initialized with moe_initialize_workspace, ensuring proper setup for multi-rank communication.

tensorrt_llm/_mnnvl_utils.py (2)

369-372: LGTM! Proper workspace initialization sequence.

The workspace initialization correctly calls moe_initialize_workspace with the appropriate rank and size parameters, followed by a barrier to ensure all ranks complete initialization before proceeding.


616-616: LGTM! Consistent list-based API for combine operation.

The moe_comm calls now correctly wrap tensors in lists, aligning with the multi-tensor API pattern introduced throughout the codebase.

Also applies to: 619-619

cpp/tensorrt_llm/kernels/moePrepareKernels.cu (4)

84-138: LGTM! Clean per-slot statistics implementation.

The refactored computeCountAndSendStatics function properly handles per-slot statistics distribution, with clear separation between count and expert statistics communication.


140-182: LGTM! Consistent receiver-side statistics handling.

The recvCountAndStatics function correctly mirrors the sender-side implementation, properly handling both count and expert statistics reception.


322-329: LGTM! Simplified memset operation.

The memsetExpertIds function correctly uses the expert IDs directly as the destination, aligning with the simplified per-slot workflow.


333-334: LGTM! Clean workspace size calculation.

The workspace size calculation is now simplified to sizeof(MoeCommFifoConnInfo) * epSize, which is cleaner and more maintainable than the previous META_SIZE/FIFO-based formula.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (6)

219-327: LGTM! Well-designed Lamport protocol implementation.

The LamportProto class provides a clean implementation of the Lamport protocol with proper invalid value handling, pack/unpack operations, and rearm functionality. The in-place nature of the protocol (no additional overhead) is efficiently implemented.


329-477: LGTM! Efficient LL128 protocol implementation.

The Ll128Proto class provides an efficient 128-byte aligned protocol with proper tail handling. The 15:16 packing ratio is well-implemented with clear logic for flag placement and tail value management.


807-1156: LGTM! Comprehensive SingleChannelCommunicator implementation.

The SingleChannelCommunicator template class provides a complete implementation for managing per-channel FIFO transfers with proper synchronization, entry management, and protocol handling. The separation of sender and receiver logic is clean and maintainable.


1158-1223: LGTM! Well-structured kernel with proper field count specialization.

The moeAllToAllKernel properly handles variable field counts through template specialization and correctly manages sender/receiver differentiation.


1276-1288: LGTM! Proper workspace initialization.

The initializeLocalWorkspace method correctly initializes the FIFO with the protocol's initialized value and zeros out the info structures.


1290-1352: LGTM! Robust kernel launch configuration.

The moeAllToAll function properly determines the optimal group count per CTA based on available shared memory and occupancy constraints, with appropriate fallback logic.

cpp/tensorrt_llm/thop/moeCommOp.cpp (3)

100-103: Confirm rankStrideInU64 is in 64-bit element units

constructWorkspace expects rankStrideInU64 measured in uint64 units. With the dtype check added earlier, allWorkspaces.stride(0) now correctly represents u64 elements. No change needed beyond the dtype check.


190-195: Good parameter threading into computeCountAndIndice

Arguments passed match kernel signature and earlier allocations. With the added shape/dtype checks, this call site looks solid.


205-207: Tuple return order is clear and aligns with the new fused workflow

The new order foregrounds send/recv counts and gathered indices. LGTM.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15609 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11749 completed with status: 'FAILURE'

@dongxuy04 dongxuy04 changed the title [None][feat]Optimize and refactor alltoall in WideEP [TRTLLM-7269][TRTLLM-6743][feat]Optimize and refactor alltoall in WideEP Aug 19, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
cpp/tensorrt_llm/thop/moeCommOp.cpp (6)

38-46: Validate dtype for allWorkspaces before data_ptr<uint64_t>().

We dereference allWorkspaces as 64-bit elements later. Add a dtype check to prevent UB if callers pass a different dtype.

Apply this diff:

     CHECK_INPUT(sendRankCumSum, torch::kInt32);
     CHECK_INPUT(sendIndiceTensor, torch::kInt32);
     CHECK_INPUT(recvRankCumSum, torch::kInt32);
     CHECK_INPUT(recvIndiceTensor, torch::kInt32);
+    CHECK_INPUT(allWorkspaces, torch::kInt64);

126-139: Initialize workspace: validate allWorkspaces dtype.

Same precondition as moeCommOp; ensure allWorkspaces holds 64-bit words before taking data_ptr<uint64_t>().

Apply this diff:

 void initializeMoeWorkspace(torch::Tensor allWorkspaces, int64_t epRank, int64_t epSize)
 {
     TORCH_CHECK(allWorkspaces.dim() == 2, "allWorkspaces must be a 2D tensor");
     TORCH_CHECK(epRank >= 0 && epRank < epSize, "epRank must be in the range [0, epSize)");
+    CHECK_INPUT(allWorkspaces, torch::kInt64);

141-151: Add shape/dtype guards for expertsIds and allWorkspaces.

computeCountAndIndice expects [tokenCount, topK] expertsIds and 64-bit workspaces.

Apply this diff:

     CHECK_INPUT(expertsIds, torch::kInt32);
+    TORCH_CHECK(expertsIds.dim() == 2, "experts_ids must be a 2D tensor [token_count, top_k]");
+    TORCH_CHECK(
+        expertsIds.size(1) == topK,
+        "experts_ids second dimension must equal top_k");
+    CHECK_INPUT(allWorkspaces, torch::kInt64);

209-221: Fix misleading tensor-dim error message.

The check enforces 2D but the message says 1D. Correct the message to avoid confusion.

Apply this diff:

-    TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 1D tensor");
+    TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 2D tensor [maxTokenCountPerRank*epSize, topK]");

70-76: Guard against overflowing the fieldsInfo array.

fieldCount must not exceed the compile-time capacity of FusedMoeFieldInfo::fieldsInfo. Add a check against MOE_COMM_FIELD_MAX_COUNT.

Apply this diff:

-    int fieldCount = inputs.size();
+    int fieldCount = static_cast<int>(inputs.size());
+    TORCH_CHECK(
+        fieldCount <= tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT,
+        "Number of fields exceeds MOE_COMM_FIELD_MAX_COUNT");

97-99: params.expertParallelInfo.topK is uninitialized before use.

You’re passing an indeterminate value into fillMetaInfo. Either set it or pass a known constant (0 if unused in fused comm meta).

Apply this diff:

-    params.sendFieldInfo.fillMetaInfo(&(params.sendCommMeta), params.expertParallelInfo.topK, false, false);
-    params.recvFieldInfo.fillMetaInfo(&(params.recvCommMeta), params.expertParallelInfo.topK, false, false);
+    params.sendFieldInfo.fillMetaInfo(&(params.sendCommMeta), /*topK=*/0, false, false);
+    params.recvFieldInfo.fillMetaInfo(&(params.recvCommMeta), /*topK=*/0, false, false);

Optionally also initialize the struct field explicitly before the calls:

+    params.expertParallelInfo.topK = 0;
🧹 Nitpick comments (5)
cpp/tensorrt_llm/thop/moeCommOp.cpp (5)

2-2: Update copyright year to include 2025.

Repo guideline says to prepend the current year; header still shows 1993-2024.

Apply this diff:

- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

86-87: Clarify magic arguments with inline comments.

Use inline C comments to document the meaning of positional booleans/integers per style guide.

Apply this diff:

-    sendFieldInfo.fillFieldPlacementInfo(0, false);
-    recvFieldInfo.fillFieldPlacementInfo(0, false);
+    sendFieldInfo.fillFieldPlacementInfo(/*groupOffset=*/0, /*isBroadcast=*/false);
+    recvFieldInfo.fillFieldPlacementInfo(/*groupOffset=*/0, /*isBroadcast=*/false);

157-205: Naming consistency: rename RecvRankCountCumSum to lowerCamelCase.

Local variables should be lowerCamelCase. Rename and update the uses in kernel calls and return tuple.

Apply this diff:

-    torch::Tensor RecvRankCountCumSum = torch::empty({epSize}, expertsIds.options().dtype(torch::kInt32));
+    torch::Tensor recvRankCountCumSum = torch::empty({epSize}, expertsIds.options().dtype(torch::kInt32));
@@
-    tensorrt_llm::kernels::moe_prepare::computeCountAndIndice(expertsIds.data_ptr<int>(),
-        sendRankCountCumSum.data_ptr<int>(), RecvRankCountCumSum.data_ptr<int>(), sendRankIndices.data_ptr<int>(),
+    tensorrt_llm::kernels::moe_prepare::computeCountAndIndice(expertsIds.data_ptr<int>(),
+        sendRankCountCumSum.data_ptr<int>(), recvRankCountCumSum.data_ptr<int>(), sendRankIndices.data_ptr<int>(),
@@
-    tensorrt_llm::kernels::moe_prepare::computeCumsum(
-        sendRankCountCumSum.data_ptr<int>(), RecvRankCountCumSum.data_ptr<int>(), epRank, epSize, stream);
+    tensorrt_llm::kernels::moe_prepare::computeCumsum(
+        sendRankCountCumSum.data_ptr<int>(), recvRankCountCumSum.data_ptr<int>(), epRank, epSize, stream);
@@
-    tensorrt_llm::kernels::moe_prepare::moveIndice(sendRankCountCumSum.data_ptr<int>(),
-        RecvRankCountCumSum.data_ptr<int>(), sendRankIndices.data_ptr<int>(), gatherSendRankIndices.data_ptr<int>(),
+    tensorrt_llm::kernels::moe_prepare::moveIndice(sendRankCountCumSum.data_ptr<int>(),
+        recvRankCountCumSum.data_ptr<int>(), sendRankIndices.data_ptr<int>(), gatherSendRankIndices.data_ptr<int>(),
@@
-    return std::make_tuple(sendRankCountCumSum, gatherSendRankIndices, RecvRankCountCumSum, gatherRecvRankIndices,
+    return std::make_tuple(sendRankCountCumSum, gatherSendRankIndices, recvRankCountCumSum, gatherRecvRankIndices,
         gatherBackwardRecvRankIndices, gatheredExpertStatics);

276-279: Unify parameter naming style in schema (allWorkspaceall_workspaces).

Elsewhere you use snake_case; keep it consistent across op schemas.

Apply this diff:

-        "mnnvl_moe_alltoallv_prepare_without_allgather(Tensor experts_ids, Tensor? experts_statics, "
-        "Tensor allWorkspace, int max_token_count_per_rank, int ep_rank, int ep_size, int expert_count, int "
+        "mnnvl_moe_alltoallv_prepare_without_allgather(Tensor experts_ids, Tensor? experts_statics, "
+        "Tensor all_workspaces, int max_token_count_per_rank, int ep_rank, int ep_size, int expert_count, int "

146-149: Avoid magic numbers for expert/slot divisibility and max bound.

Replace 4 and 512 with named constexpr constants to document intent and ease maintenance.

You can add near the top of this file (namespace scope):

namespace {
constexpr int kEXPERT_SLOT_ALIGNMENT = 4;
constexpr int kMAX_EXPERT_PLUS_ONE = 512;
} // anonymous namespace

Then update:

-    TORCH_CHECK(expertCount % 4 == 0, "expertCount must be divisible by 4");
-    TORCH_CHECK(slotCount % 4 == 0, "slotCount must be divisible by 4");
-    TORCH_CHECK(expertCount + 1 <= 512, "expertCount + 1 is larger than 512");
+    TORCH_CHECK(expertCount % kEXPERT_SLOT_ALIGNMENT == 0, "expertCount must be divisible by 4");
+    TORCH_CHECK(slotCount % kEXPERT_SLOT_ALIGNMENT == 0, "slotCount must be divisible by 4");
+    TORCH_CHECK(expertCount + 1 <= kMAX_EXPERT_PLUS_ONE, "expertCount + 1 is larger than 512");
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c9360c9 and 403a4a9.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/thop/moeCommOp.cpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/thop/moeCommOp.cpp
🧠 Learnings (2)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/thop/moeCommOp.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/thop/moeCommOp.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@dongxuy04 dongxuy04 changed the title [TRTLLM-7269][TRTLLM-6743][feat]Optimize and refactor alltoall in WideEP [TRTLLM-6743][feat]Optimize and refactor alltoall in WideEP Aug 19, 2025
@dongxuy04 dongxuy04 force-pushed the user/dongxuy/a2a_opt branch from 403a4a9 to e197aa5 Compare August 19, 2025 09:53
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
@dongxuy04 dongxuy04 force-pushed the user/dongxuy/a2a_opt branch from 8d7c3a5 to b507f72 Compare August 23, 2025 12:54
@dongxuy04
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16268 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16257 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (10)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (9)

49-58: Stop using global C RNG; use a local C++ RNG.

Remove std::srand(42) and add a deterministic engine as a member. This also supports switching rand() to distributions below.

     void SetUp() override
     {
         if (shouldSkip())
         {
             skipped = true;
             GTEST_SKIP() << "Skipping due to no/unsupported GPU";
         }
         TLLM_CUDA_CHECK(cudaStreamCreate(&stream));
-        std::srand(42); // Initialize random seed
     }
@@
-    bool skipped = false;
-    cudaStream_t stream = nullptr;
+    bool skipped = false;
+    cudaStream_t stream = nullptr;
+    std::mt19937 rng{42}; // deterministic engine for tests

Also applies to: 68-70


71-107: Replace rand() with distributions (deterministic, guideline-compliant).

Prefer C++ RNGs over C APIs and avoid implicit RAND_MAX scaling.

         else
         {
             // Default initialization with random values
-            for (size_t i = 0; i < count; i++)
-            {
-                if constexpr (std::is_same_v<T, float>)
-                {
-                    (*hostPtr)[i] = static_cast<float>(rand()) / RAND_MAX * 10.0f;
-                }
-                else if constexpr (std::is_same_v<T, int>)
-                {
-                    (*hostPtr)[i] = rand() % 1000;
-                }
-                else
-                {
-                    (*hostPtr)[i] = static_cast<T>(rand() % 100);
-                }
-            }
+            std::uniform_real_distribution<float> floatDist(0.0f, 10.0f);
+            std::uniform_int_distribution<int> intDist(0, 999);
+            std::uniform_int_distribution<int> byteDist(0, 99);
+            for (size_t i = 0; i < count; i++)
+            {
+                if constexpr (std::is_same_v<T, float>)
+                {
+                    (*hostPtr)[i] = floatDist(rng);
+                }
+                else if constexpr (std::is_same_v<T, int>)
+                {
+                    (*hostPtr)[i] = intDist(rng);
+                }
+                else
+                {
+                    (*hostPtr)[i] = static_cast<T>(byteDist(rng));
+                }
+            }
         }

758-767: Prefer RAII for temporary buffers to simplify cleanup.

Use std::vector or std::unique_ptr for resultRecv* arrays to avoid manual delete[] and reduce error surface.

Example with std::vector:

-        int* resultRecvTokenSlots = nullptr;
+        std::vector<int> resultRecvTokenSlots;
...
-            resultRecvTokenSlots = new int[tokenCount * topK];
-            TLLM_CUDA_CHECK(cudaMemcpy(
-                resultRecvTokenSlots, deviceRecvTokenSlots, tokenCount * topK * sizeof(int), cudaMemcpyDeviceToHost));
+            resultRecvTokenSlots.resize(tokenCount * topK);
+            TLLM_CUDA_CHECK(cudaMemcpy(
+                resultRecvTokenSlots.data(), deviceRecvTokenSlots, tokenCount * topK * sizeof(int), cudaMemcpyDeviceToHost));

Also applies to: 1285-1294


1-15: Switch to SPDX header (project standard).

Replace the legacy Apache header block with the SPDX-style header used elsewhere in this repo.

-/*
- * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
+// SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0

17-25: Missing headers for used APIs (compile issues + UB fix prerequisites).

  • You call std::random_shuffle and std::shuffle replacements need .
  • You use std::memcpy in recommended fixes; add .
  • You use std::is_same_v but <type_traits> is not included (this is a hard compile error).

Add these includes.

 #include <atomic>
 #include <chrono>
 #include <functional>
 #include <gtest/gtest.h>
 #include <memory>
 #include <random>
+#include <algorithm>
+#include <cstring>
+#include <type_traits>
 #include <thread>
 #include <vector>

109-113: UB: deleting T[] via delete[] on a char; make cleanup type-aware and fix call sites.*

Current delete[] static_cast<char*>(hostPtr) is undefined behavior for non-char arrays. Provide a templated deleter and use the correct element type everywhere.

-    void cleanup(void* hostPtr, void* devicePtr)
-    {
-        delete[] static_cast<char*>(hostPtr);
-        TLLM_CUDA_CHECK(cudaFree(devicePtr));
-    }
+    template <typename T>
+    void cleanupArray(T* hostPtr, void* devicePtr)
+    {
+        delete[] hostPtr;
+        TLLM_CUDA_CHECK(cudaFree(devicePtr));
+    }

Apply across call sites (examples):

- cleanup(hostTokenSlots, deviceTokenSlots);
+ cleanupArray<int>(hostTokenSlots, deviceTokenSlots);

- cleanup(hostScales, deviceScales);
+ cleanupArray<float>(hostScales, deviceScales);

- cleanup(hostFieldPtrs[i], deviceFieldPtrs[i]);
+ cleanupArray<uint8_t>(static_cast<uint8_t*>(hostFieldPtrs[i]), deviceFieldPtrs[i]);

- cleanup(hostShmDump, deviceShmDump);
+ cleanupArray<int>(hostShmDump, deviceShmDump);

Run this to list remaining old calls that need updating:

#!/bin/bash
rg -nP 'cleanup\(' cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp

148-151: std::random_shuffle was removed in C++17; use std::shuffle with a seeded engine.

This is a compile error with -std=c++17. Replace with std::shuffle.

-        // Shuffle unused indices for random assignment
-        std::srand(42); // Fixed seed for reproducible tests
-        std::random_shuffle(unusedIndices.begin(), unusedIndices.end());
+        // Shuffle unused indices for random assignment
+        std::mt19937 gen(42); // Fixed seed for reproducible tests
+        std::shuffle(unusedIndices.begin(), unusedIndices.end(), gen);

156-176: Fill invalid mappings when unusedIndices are exhausted to avoid -1 entries.

Prevent leaving -1 in fullMapping, which propagates invalid indices downstream.

             if (i < providedSize && fullMapping[i] == -1)
             {
                 // Fix invalid mapping
-                if (unusedIdx < unusedIndices.size())
+                if (unusedIdx < static_cast<int>(unusedIndices.size()))
                 {
                     fullMapping[i] = unusedIndices[unusedIdx++];
                 }
+                else
+                {
+                    // Fallback: identity mapping
+                    fullMapping[i] = i;
+                }
             }

472-476: Fix strict-aliasing UB when bit-packing float into int.

Replace reinterpret_cast-based type punning with std::memcpy.

-                float scale
-                    = 1.0f + static_cast<float>(warpIdx) * 0.1f + static_cast<float>(offsetInWarp - topK) * 0.01f;
-                return *reinterpret_cast<int*>(&scale);
+                float scale = 1.0f + static_cast<float>(warpIdx) * 0.1f
+                             + static_cast<float>(offsetInWarp - topK) * 0.01f;
+                int bits;
+                static_assert(sizeof(bits) == sizeof(scale), "size mismatch");
+                std::memcpy(&bits, &scale, sizeof(bits));
+                return bits;
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)

16-16: Use project include guard instead of #pragma once.

Per coding guidelines, replace #pragma once with TRTLLM_FUSEDMOECOMMKERNELS_H include guard.

-#pragma once
+#ifndef TRTLLM_FUSEDMOECOMMKERNELS_H
+#define TRTLLM_FUSEDMOECOMMKERNELS_H
@@
-} // namespace kernels
-} // namespace tensorrt_llm
-
+#endif // TRTLLM_FUSEDMOECOMMKERNELS_H

Also applies to: 561-563

🧹 Nitpick comments (19)
tests/unittest/_torch/thop/test_moe_alltoall.py (4)

75-81: Avoid silent truncation when sizing workspaces; compute u64 words with ceiling.

get_moe_commworkspace_size_per_rank() returns bytes. Using // 8 truncates if the size isn’t 8-byte aligned (future changes could break this). Compute words with a ceiling division in all three places and use a clearer variable name.

Apply this diff in the three blocks:

-        workspace_size = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(1)
-        all_workspaces = torch.zeros(1,
-                                     workspace_size // 8,
+        workspace_size_bytes = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(1)
+        workspace_words = (workspace_size_bytes + 7) // 8
+        all_workspaces = torch.zeros(1,
+                                     workspace_words,
                                      dtype=torch.uint64,
                                      device=torch.device('cuda'))

And similarly in do_warmup() and the multi-rank section:

-        workspace_size = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(
-            world_size)
-        all_workspaces = torch.zeros(world_size,
-                                     workspace_size // 8,
+        workspace_size_bytes = torch.ops.trtllm.get_moe_commworkspace_size_per_rank(world_size)
+        workspace_words = (workspace_size_bytes + 7) // 8
+        all_workspaces = torch.zeros(world_size,
+                                     workspace_words,
                                      dtype=torch.uint64,
                                      device=torch.device('cuda'))

Also applies to: 113-119, 254-260


82-89: Clarify/standardize the optional boolean list arg to moe_comm.

Here you pass [True], while the multi-rank calls below omit the flag. If the last argument toggles “combine” or “gathered” mode, keep usage consistent in tests (or add a short comment on its semantics) to reduce reader confusion.


95-113: Warmup defeats stream isolation due to global synchronizes.

do_warmup() calls torch.cuda.synchronize() at entry/exit, so wrapping it in per-rank streams won’t avoid first-launch syncs. Either drop the leading sync or allow an optional stream arg and run the ops on that stream to preserve isolation.

Minimal tweak:

-    def do_warmup(self):
-        torch.cuda.synchronize()
+    def do_warmup(self):
         input_tensor = torch.randn(1,
                                    8,

Also applies to: 119-122


526-526: Wrap overlong lines flagged by Ruff (E501).

A few lines exceed 120 chars. Break them across lines (parentheses) to satisfy style without disabling linting.

Also applies to: 528-528, 606-607

cpp/tensorrt_llm/kernels/moePrepareKernels.h (4)

43-48: Avoid magic-numbered FIFO array size; introduce a named constant.

values[512] is opaque and hard to change. Prefer a constexpr and reuse it for the member.

 namespace moe_prepare
 {
 
-#define UNIT_PER_PIPELINE 128
+#define UNIT_PER_PIPELINE 128
 #define PIPELINE_PER_CTA 4
 #define CUMSUM_THREADS_PER_BLOCK 128
 
 static constexpr int THREADS_PER_PIPELINE = UNIT_PER_PIPELINE;
+static constexpr int kFIFO_VALUES_SIZE = 512;
@@
 struct ALIGN_256 MoeCommFifoConnInfo
 {
     volatile uint64_t head;   // write position
     volatile uint64_t tail;   // read position
-    int volatile values[512]; // for values
+    int volatile values[kFIFO_VALUES_SIZE]; // for values
 };

59-66: Document/validate alignment of workspacePtr before casting to MoeCommFifoConnInfo*.

Given ALIGN_256 on MoeCommFifoConnInfo, consider adding an assertion or a brief comment guaranteeing workspacePtr’s base and rankStrideInU64 keep each instance 256B-aligned to avoid misaligned accesses on device.


1-15: License years may need updating to include 2025.

Repo-wide policy often bumps to the current year on touched files. If that applies here, extend 2022-2024 to 2022-2025.


71-75: Align memsetExpertIds parameter names

The declaration in moePrepareKernels.h uses epSize, while the definition in moePrepareKernels.cu uses rankCount. The other kernels (computeCumsum, moveIndice) already match parameter names between header and implementation.

To resolve, pick one of the following:

  • Rename the header parameter to rankCount
  • Rename the .cu parameter to epSize

Example header change (rename to rankCount):

--- a/cpp/tensorrt_llm/kernels/moePrepareKernels.h
+++ b/cpp/tensorrt_llm/kernels/moePrepareKernels.h
@@ -80,7 +80,7 @@ void memsetExpertIds(int* expertIds, int* recvCountsCumsum, int maxTokenCountPe
-    int epSize, cudaStream_t stream);
+    int rankCount, cudaStream_t stream);

Files to update:

  • cpp/tensorrt_llm/kernels/moePrepareKernels.h (declaration)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu (definition)

Verification script (already run) confirms only memsetExpertIds is mismatched:

rg -nP 'void\s+memsetExpertIds\s*\(' -C2 cpp/tensorrt_llm/kernels
rg -nP 'void\s+computeCumsum\s*\(' -C2 cpp/tensorrt_llm/kernels
rg -nP 'void\s+moveIndice\s*\(' -C2 cpp/tensorrt_llm/kernels
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

353-358: Use self.num_slots for slot_count for clarity (even if equal to num_experts).

In this module num_slots == num_experts, but conceptually they differ in other backends. Passing self.num_slots improves consistency with WideEP and reduces future diff churn.

-                max_num_token, self.ep_rank, self.ep_size, self.num_experts,
-                self.num_experts, top_k)
+                max_num_token, self.ep_rank, self.ep_size, self.num_experts,
+                self.num_slots, top_k)

373-376: Minor: maintain x_row only once.

You set x_row earlier and reassign from x_sf.shape[0] here. Both should match; consider asserting equality or dropping the reassignment to avoid hidden skew if upstream changes.

tensorrt_llm/_mnnvl_utils.py (1)

339-347: Type annotation mismatch: local_gather_indices can be None.

You assign None but the dataclass types it as torch.Tensor. Make it Optional to reflect reality and help static analyzers.

-from dataclasses import dataclass
-from typing import List, Optional, Union
+from dataclasses import dataclass
+from typing import List, Optional, Union
@@
-@dataclass
-class MoEAlltoallInfo:
-    local_gather_indices: torch.Tensor
+@dataclass
+class MoEAlltoallInfo:
+    local_gather_indices: Optional[torch.Tensor]

Also applies to: 433-444

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

465-476: Two-stage prepare/dispatch API looks clean; minor naming nit.

alltoall_prepare returns token_selected_slots unchanged; consider returning only the new artifacts (gathered_*, alltoall_info) to reduce redundancy, or rename to prepare_slots_and_info if the intent is to emphasize slot-based routing.

Also applies to: 867-876


879-895: Dispatch + memset_expert_ids is correct; ensure scales can be None.

Downstream fused_moe accepts token_final_scales=None for the “apply-on-input” path. Your dispatch preserves that — good. Consider a brief comment here to clarify that token_final_scales may be None by design.


465-465: Ruff E501: one overlong line; wrap to meet 120 cols.

Wrap line 465 to satisfy linting.

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (1)

296-321: Optional: avoid reinterpret_cast<float const>(int) when reading back scales.**

Reading floats through an int* via reinterpret_cast violates aliasing rules; consider memcpy per element for safety in both verifyG2SResults and verifyS2GResults.

Also applies to: 518-526

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (4)

38-40: Clarify or remove “do we use this?” in public header.

Question-style comments in public headers age poorly. Either justify the member or remove the aside.


109-120: Avoid shadowing ‘alignedUnitBytes’ inside getFieldUncompactSize.

Shadowing the name with a constexpr of the same identifier harms readability. Use a distinct constant.

-        if (alignedUnitBytes != 16)
-        {
-            constexpr int alignedUnitBytes = BYTES_PER_16B_BLOCK;
-            currentFieldSize = currentFieldSize / alignedUnitBytes * alignedUnitBytes;
-            currentFieldSize += alignedUnitBytes * 2;
-        }
+        if (alignedUnitBytes != BYTES_PER_16B_BLOCK)
+        {
+            constexpr int kAlignedBytes = BYTES_PER_16B_BLOCK;
+            currentFieldSize = currentFieldSize / kAlignedBytes * kAlignedBytes;
+            currentFieldSize += kAlignedBytes * 2;
+        }

485-491: Typo: mappedMemoryrank → mappedMemoryRank.

Minor readability polish in pointer arithmetic helper.

-        int mappedMemoryrank = isSenderSideBuffer ? pairInfo.senderRank : pairInfo.receiverRank;
+        int mappedMemoryRank = isSenderSideBuffer ? pairInfo.senderRank : pairInfo.receiverRank;
-        auto* mappedMemory = reinterpret_cast<uint8_t*>(workspacePtr + mappedMemoryrank * rankStrideInU64);
+        auto* mappedMemory = reinterpret_cast<uint8_t*>(workspacePtr + mappedMemoryRank * rankStrideInU64);

292-300: Fix typos in diagnostics and variable name.

“at lease” → “at least”; “perferredChannel” → “preferredChannel”.

-        TLLM_CHECK_WITH_INFO(
-            blockCountPerChannel <= smCount, "GPU should support at lease one channel, usableSmCount=%d", smCount);
-        int perferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication
-        int channelCount = std::max(perferredChannel, 1);          // at lease one channel
+        TLLM_CHECK_WITH_INFO(
+            blockCountPerChannel <= smCount, "GPU should support at least one channel, usableSmCount=%d", smCount);
+        int preferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication
+        int channelCount = std::max(preferredChannel, 1);          // at least one channel
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8d7c3a5 and b507f72.

📒 Files selected for processing (19)
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/moeCommKernels.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/moeCommKernels.h (0 hunks)
  • cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1 hunks)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu (7 hunks)
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h (3 hunks)
  • cpp/tensorrt_llm/thop/moeCommOp.cpp (6 hunks)
  • cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp (0 hunks)
  • cpp/tests/unit_tests/kernels/CMakeLists.txt (1 hunks)
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (1 hunks)
  • tensorrt_llm/_mnnvl_utils.py (5 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5 hunks)
  • tests/integration/test_lists/waives.txt (0 hunks)
  • tests/unittest/_torch/thop/test_moe_alltoall.py (9 hunks)
💤 Files with no reviewable changes (4)
  • cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
  • tests/integration/test_lists/waives.txt
  • cpp/tensorrt_llm/kernels/moeCommKernels.h
  • cpp/tensorrt_llm/kernels/moeCommKernels.cu
🚧 Files skipped from review as they are similar to previous changes (8)
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • cpp/tests/unit_tests/kernels/CMakeLists.txt
  • cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • cpp/tensorrt_llm/kernels/moePrepareKernels.cu
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/thop/moeCommOp.cpp
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_mnnvl_utils.py
  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_mnnvl_utils.py
  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
🧠 Learnings (6)
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tests/unittest/_torch/thop/test_moe_alltoall.py
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

  • cpp/tensorrt_llm/kernels/moePrepareKernels.h
  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

  • cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{h,hpp,hxx,hh,cuh} : Header files must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Applied to files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
🧬 Code graph analysis (7)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)
tensorrt_llm/_torch/utils.py (2)
  • _ (190-196)
  • shape (103-104)
tensorrt_llm/_mnnvl_utils.py (3)
  • MnnvlMoe (350-617)
  • mnnvl_moe_alltoallv_prepare_without_allgather (399-443)
  • mnnvl_moe_alltoallv (528-589)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • ceil_div (10-21)
tensorrt_llm/_mnnvl_utils.py (1)
tensorrt_llm/python_plugin.py (2)
  • workspace (331-333)
  • workspace (336-338)
tests/unittest/_torch/thop/test_moe_alltoall.py (2)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (16)
  • _ (13-56)
  • _ (60-62)
  • _ (65-68)
  • _ (71-76)
  • _ (79-84)
  • _ (87-99)
  • _ (102-107)
  • _ (110-115)
  • _ (118-123)
  • _ (126-138)
  • _ (141-147)
  • _ (150-151)
  • _ (154-157)
  • _ (161-162)
  • _ (165-166)
  • _ (169-180)
tensorrt_llm/_mnnvl_utils.py (1)
  • mnnvl_moe_alltoallv_prepare_without_allgather (399-443)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
tensorrt_llm/_mnnvl_utils.py (8)
  • MnnvlMemory (53-336)
  • initialize (91-100)
  • MnnvlMoe (350-617)
  • get_moe_workspaces (358-373)
  • get_moe_prepare_workspace (376-387)
  • mnnvl_moe_alltoallv_prepare_without_allgather (399-443)
  • MoEAlltoallInfo (340-347)
  • mnnvl_moe_alltoallv (528-589)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (3)
  • is_static_routing (350-351)
  • is_static_routing (759-761)
  • get_local_statistic_tensor (563-577)
cpp/tensorrt_llm/kernels/moePrepareKernels.h (2)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (6)
  • computeCumsum (304-310)
  • computeCumsum (304-304)
  • moveIndice (312-320)
  • moveIndice (312-314)
  • memsetExpertIds (322-329)
  • memsetExpertIds (322-323)
cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
  • memsetExpertIds (225-243)
  • memsetExpertIds (225-226)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (3)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
  • kernels (27-207)
  • computeWorkspaceSizePreRank (520-526)
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1)
  • kernels (22-46)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (12)
  • tokenCount (736-776)
  • tokenCount (736-736)
  • tokenCount (788-839)
  • tokenCount (788-788)
  • launchSingleG2S (1137-1152)
  • launchSingleG2S (1137-1138)
  • launchSingleS2G (1190-1205)
  • launchSingleS2G (1190-1191)
  • launchLoopback (1273-1294)
  • launchLoopback (1273-1275)
  • launchLocalFifoSendRecv (1335-1367)
  • launchLocalFifoSendRecv (1335-1338)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (3)
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (2)
  • tensorrt_llm (20-47)
  • kernels (22-46)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (24)
  • void (177-215)
  • void (872-936)
  • void (1097-1135)
  • void (1154-1188)
  • void (1207-1270)
  • void (1297-1333)
  • fillFieldInfo (177-177)
  • setMaxUsableSmCount (1070-1073)
  • setMaxUsableSmCount (1070-1070)
  • int (344-350)
  • getFusedMoeCommWorkspaceSize (1075-1080)
  • getFusedMoeCommWorkspaceSize (1075-1075)
  • fillMetaInfo (952-959)
  • fillMetaInfo (952-953)
  • fillFieldPlacementInfo (961-987)
  • fillFieldPlacementInfo (961-961)
  • initializeLocalWorkspace (989-1001)
  • initializeLocalWorkspace (989-989)
  • moeAllToAll (1003-1065)
  • moeAllToAll (1003-1003)
  • constructWorkspace (1082-1087)
  • constructWorkspace (1082-1082)
  • initializeFusedMoeLocalWorkspace (1089-1092)
  • initializeFusedMoeLocalWorkspace (1089-1089)
cpp/include/tensorrt_llm/common/cudaUtils.h (1)
  • getMultiProcessorCount (393-400)
🪛 Ruff (0.12.2)
tests/unittest/_torch/thop/test_moe_alltoall.py

526-526: Line too long (236 > 120)

(E501)


528-528: Line too long (248 > 120)

(E501)


606-606: Line too long (186 > 120)

(E501)


607-607: Line too long (136 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

465-465: Line too long (124 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (12)
tests/unittest/_torch/thop/test_moe_alltoall.py (3)

125-139: Good parameter coverage for multi-rank smoke/large tests.


236-239: Reference reconstruction logic is correct but fragile to future shape changes.

The global output_indice_offset assumes each token contributes exactly once across ranks. If future tests exercise multi-lane routes (top-k > 1), this needs adjustment to accumulate per-rank-per-tensor counts.

Would you like me to extend the test to a top-k>1 variant and validate the reconstruction order?


271-286: Nice parallel a2a invocation and reconstruction with per-rank streams.

Also applies to: 290-306

cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)

35-36: Nice: replaced macro with a constexpr for threads-per-pipeline.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

363-371: Good: unified multi-tensor alltoallv dispatch and post-population of expert IDs.

This matches the fused path and removes the padding hacks.

tensorrt_llm/_mnnvl_utils.py (3)

370-373: Nice: explicit workspace init + barrier avoids first-use races.


528-590: Multi-tensor a2a API: solid handling of None entries and single-tensor round-trip.

The validation and restoration logic look correct.


602-615: Combine path correctly switches to list-based moe_comm and unpacks.

The reshape-and-sum pattern matches the expected top-k combine semantics.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

196-202: Good: allocate both a2a and prepare workspaces under MNNVL.

This mirrors the fused path expectations.


299-301: Always-on alltoall gating for MNNVL is reasonable.

This simplifies control flow for the fused path.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)

315-329: Confirm launch dimension policy.

The 2 in grid.z encodes both directions; ensure kernels assume grid.z==2 and no divergence when hasBasicFields=false. If not guaranteed, guard by template parameter or assert.

Would you like a quick audit script to grep kernel launch sites and ensure consistent interpretation of grid.z across kernels?


1-15: Adopt SPDX header.

Align with repository convention: replace the block header with SPDX lines.

-/*
- * Copyright (c) 2019-2025, NVIDIA CORPORATION.  All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
+// SPDX-FileCopyrightText: Copyright (c) 2019-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
⛔ Skipped due to learnings
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

@dongxuy04
Copy link
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16282 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16268 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16282 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12240 completed with status: 'FAILURE'

@kaiyux
Copy link
Member

kaiyux commented Aug 24, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16287 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16287 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12244 completed with status: 'FAILURE'

@dongxuy04
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16296 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16296 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12251 completed with status: 'SUCCESS'

@dongxuy04 dongxuy04 merged commit 19a0ea3 into NVIDIA:main Aug 24, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants