KEMBAR78
[TRTLLM-7319][perf] Fuse slicing into MoE. by bobboli · Pull Request #6728 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@bobboli
Copy link
Collaborator

@bobboli bobboli commented Aug 8, 2025

Summary by CodeRabbit

  • New Features

    • Added optional unpadded (original) hidden/column-size parameter (unpadded_hidden_size / unpadded_cols) across MoE C++ APIs, fused operators, Torch bindings, fake ops, and profiler; added shape_override for fused epilogue/scatter.
  • Bug Fixes

    • Finalization and finalize-scale kernels are padding-aware: reads/writes use padded vs unpadded dims with bounds checks to prevent out-of-bounds writes.
  • Refactor

    • Propagated padding-aware dimensions through routing, GEMM/fused operator paths, launchers, profiler, runtime call sites; added cstdint include.
  • Tests

    • Unit and profiler tests updated to validate unpadded hidden-size, expanded GEMM shapes, and new scenarios.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@bobboli bobboli requested review from a team as code owners August 8, 2025 03:40
@bobboli bobboli requested review from liji-nv and mikeiovine August 8, 2025 03:40
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 8, 2025

📝 Walkthrough

Walkthrough

Threaded padded vs unpadded hidden/column sizes through MOE codepaths: added <cstdint> and a new int64_t unpadded_hidden_size / unpadded_cols parameters across headers, CUDA kernels, launchers, profiler, C++/Python bindings, plugin calls, fusion epilogue structs, tests and benchmarks; added shape_override threading and bounds checks for padding-aware finalization.

Changes

Cohort / File(s) Change Summary
Cutlass MOE Kernel APIs & Headers
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h, cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
Added #include <cstdint>; inserted int64_t unpadded_hidden_size (and reordered params) across public/virtual/static signatures (runMoe, gemm1/gemm2, BlockScaleFC2, computeStridesTmaWarpSpecializedDispatch, setupTmaWarpSpecializedInputs, etc.) and propagated overrides.
Finalize Routing Kernels & Launcher
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu, cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
Replaced single cols with padded_cols + unpadded_cols; updated kernel and launcher signatures; added alignment asserts, bounds checks, padding-aware indexing and write guards.
GEMM Paths / Profiler Backend
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu, cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
Added int64_t mExpertUnpaddedHiddenSize to GemmProfilerBackend; init and internal profiler/GEMM2 calls accept and forward unpadded hidden-size.
Cutlass MOE CUDA Implementations & Launchers
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/*.cu, .../launchers/*.inl
Propagated unpadded/orig hidden-size through finalize/no-filling finalize, gemm2, BlockScaleFC2, loraFC1/FC2, runMoe, setupTma..., and updated TMA/Fusion launcher arg lists to include shape_override and unpadded sizes.
MixtureOfExperts Plugin (TensorRT path)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
Under OSS CUTLASS branch, duplicated hidden-size argument (pass padded==unpadded with comment) into enqueue and GemmProfiler init calls; non-OSS branch unchanged.
Fused MoE C++ Op (THOP)
cpp/tensorrt_llm/thop/moeOp.cpp
Added optional unpadded_hidden_size to runMoe/runMoeMinLantency/runGemmProfile; compute effective unpadded value, adjust output shapes, and pass it into kernel/profiler calls.
MoE Finalize Scale Op (C++ / Torch Bindings)
cpp/tensorrt_llm/thop/moeUtilOp.cpp, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Extended finalize-scale host/launcher and Torch fake op to accept unpadded_hidden_size; final_output allocation uses unpadded dim; TORCH_LIBRARY fragment and bindings updated.
Python Custom Ops / API
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py, .../cpp_custom_ops.py
Added optional unpadded_hidden_size to fused MoE / moe_finalize_scale_op fake/real op signatures; MoERunner stores/forwards this value (defaults to hidden_size when absent).
Fused MoE Python Modules
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py, .../fused_moe_deepgemm.py, .../fused_moe_cute_dsl.py
Introduced self.unpadded_hidden_size, pass it into fused operator / finalize calls; removed some manual post-op cropping in Cutlass path; adjusted swizzle/scaling-factor flags.
Fusion / Scatter Epilogue Adjustments
cpp/tensorrt_llm/cutlass_extensions/include/.../sm90_visitor_scatter.hpp, cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h, .../moe_gemm_tma_warp_specialized_input.cu, .../moe_gemm_tma_ws_launcher.inl
Added MajorMode-based StrideIndex; added shape_override to Arguments/Params and FusedFinalizeEpilogue; threaded shape_override through argument packing, predication, and epilogue launch args.
Microbenchmarks / Tests
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h, cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
Updated benchmark/test call sites and profiler shape params to pass padded+unpadded hidden sizes; added mUnpaddedHiddenSize test member, adjusted validations/loops to use unpadded size, added new unpadded test case, widened profiler/GEMM shapes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Py as Python API
    participant Op as Torch Op
    participant Cpp as C++ MoE Runner
    participant Prof as GemmProfilerBackend
    participant CUDA as CUDA Kernel

    Py->>Op: fused_moe(..., hidden_size(padded), unpadded_hidden_size)
    Op->>Cpp: runMoe(..., hidden_size(padded), unpadded_hidden_size, ...)
    alt profiling path
        Cpp->>Prof: init/runProfiler(..., expert_hidden_size=padded, expert_unpadded_hidden_size=unpadded, ...)
        Prof->>Cpp: profiler results
    end
    Cpp->>CUDA: finalizeMoeRoutingKernelLauncher(..., padded_cols, unpadded_cols, ...)
    CUDA-->>Cpp: final_output (shape uses unpadded_hidden_size)
    Cpp-->>Op: final_output
    Op-->>Py: final_output
Loading
sequenceDiagram
    autonumber
    participant Runner as CutlassMoeFCRunner
    participant Kernel as GEMM2 Kernel

    Runner->>Kernel: gemm2(..., hidden_size=padded, orig_hidden_size=unpadded, ...)
    Kernel-->>Runner: computed outputs (padding-aware reads/writes)
    Runner-->>Caller: merged/blocked outputs (finalization uses unpadded cols/hidden)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

SW Architecture

Suggested reviewers

  • hlu1
  • litaotju
  • yizhang-nv

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@bobboli bobboli requested review from StudyingShao, djns99, hlu1 and jinyangyuan-nvidia and removed request for liji-nv and mikeiovine August 8, 2025 03:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

2-2: Update copyright year to include 2025

According to the coding guidelines, all TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. Please update the copyright notice to include 2025.

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
♻️ Duplicate comments (1)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

979-979: Same TODO comment issue as line 966.

This line has the same duplicated parameter and TODO comment as line 966. Ensure both code paths receive the same fix when implementing proper original hidden size handling.

🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)

18-18: Use preprocessor guard instead of pragma once.

The coding guidelines specify using a preprocessor guard with the format TRTLLM_<FILENAME> instead of #pragma once.

Replace with:

-#pragma once
+#ifndef TRTLLM_MOE_KERNELS_H
+#define TRTLLM_MOE_KERNELS_H

And add the closing directive at the end of the file:

#endif // TRTLLM_MOE_KERNELS_H
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

1777-1848: Well-implemented padded/original column size separation

The implementation correctly distinguishes between padded and original hidden sizes throughout the kernel. The assertions ensure validity, and the boundary check prevents out-of-bounds writes.

Consider adding a comment explaining why padded columns are used for strides while original columns determine write boundaries, as this distinction is critical for memory safety.

 __global__ void finalizeMoeRoutingKernel(GemmOutputType const* expanded_permuted_rows,
     OutputType* reduced_unpermuted_output, ScaleBiasType const* bias, float const* scales,
     int const* unpermuted_row_to_permuted_row, int const* token_selected_experts, int64_t const padded_cols,
     int64_t const orig_cols, int64_t const experts_per_token, int const num_experts_per_node, int const start_expert_id)
 {
+    // padded_cols is used for memory strides to ensure alignment
+    // orig_cols is the actual data size to prevent out-of-bounds writes
     assert(padded_cols % 4 == 0);
     assert(orig_cols % 4 == 0);
     assert(orig_cols <= padded_cols);
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f2f5cc and 7e1b005.

📒 Files selected for processing (12)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (9 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (21 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (10 hunks)
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp (6 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧠 Learnings (3)
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Applied to files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🔇 Additional comments (36)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1)

238-239: LGTM! Proper implementation of original hidden size parameter.

The addition of self.original_hidden_size alongside the existing x.shape[1] parameter correctly implements the interface update to distinguish between padded and original hidden sizes. The comments clearly indicate the purpose of each parameter.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)

474-475: LGTM! Consistent implementation across MoE backends.

This change mirrors the implementation in the CuteDSL backend, correctly adding self.original_hidden_size as an additional parameter to distinguish from the padded hidden size. The consistency across different MoE backends is excellent for maintainability.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

148-148: LGTM! Clean parameter addition for original hidden size support.

The new original_hidden_size parameter is properly added to both the function signature and fake registration with appropriate defaults, maintaining backward compatibility. The parameter is correctly passed through to the underlying MoE runner.

Also applies to: 229-229, 262-262

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

89-91: LGTM! Proper storage of original hidden size before padding.

The constructor correctly stores the original hidden size before any potential padding adjustments, which enables the fused MoE operation to handle output sizing correctly without manual slicing.


424-424: LGTM! Correct parameter passing to eliminate manual slicing.

The stored original_hidden_size is properly passed to the fused MoE custom operator, enabling the kernel to handle output tensor sizing correctly and eliminating the need for manual slicing operations.

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)

466-466: LGTM! Correct fake implementation update for original hidden size.

The fake registration properly adds the orig_hidden_size parameter and uses it for output tensor sizing instead of the padded hidden_size. This ensures consistency with the actual kernel implementation that now handles padded vs original sizes explicitly.

Also applies to: 475-477

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1)

68-70: All Calls Updated to New Signature

I’ve verified that every invocation of finalizeMoeRoutingKernelLauncher now supplies both padded_cols and orig_cols:

  • In cpp/tensorrt_llm/thop/moeUtilOp.cpp (around line 241), the call passes hidden_size and orig_hidden_size.
  • In cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (lines 2988–2995 and 3306–3314), each call includes hidden_size (as padded_cols) and orig_hidden_size (as orig_cols).

No calls remain using the old single-cols parameter. This change is fully propagated—no further updates are needed here.

cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (3)

1-16: LGTM! Copyright header is compliant.

The NVIDIA copyright header includes the current year (2025) as required by coding guidelines.


25-25: LGTM! Proper use of C++ standard header.

Adding <cstdint> follows the coding guideline to use C++ standard headers instead of C headers like <stdint.h>. This is likely needed for the int64_t type used in the new parameter.


797-799: Approved: orig_hidden_size Addition is Consistent Across MOE Code

Verified that the new orig_hidden_size parameter appears in all relevant MOE interfaces and matches the ordering in:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h

No other BlockScaleFC2 overloads or call sites were found, so this single signature update aligns with the broader architectural change and requires no further modifications.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)

1860-1955: Consistent implementation in NoFilling variant

The changes to finalizeMoeRoutingNoFillingKernel correctly mirror those in the regular kernel, maintaining consistency across both implementations.


1966-2014: Correct parameter propagation in launcher

The launcher function properly handles the new padded_cols and orig_cols parameters, passing them consistently to both kernel variants.


2021-2024: Template instantiation correctly updated

The macro instantiation properly includes the new column size parameters in the function signature.


2236-2239: Comprehensive and consistent function call updates

All function calls throughout the file have been correctly updated to pass both hidden_size (padded) and orig_hidden_size (original) parameters. This ensures memory operations use appropriate strides while respecting actual data boundaries.

Also applies to: 2989-2993, 3215-3218, 3308-3311, 3695-3700, 3800-3805, 4768-4771

cpp/tensorrt_llm/thop/moeUtilOp.cpp (7)

237-237: LGTM: Parameter addition follows consistent pattern.

The orig_hidden_size parameter is correctly positioned in the function signature, following the established pattern of placing it after hidden_size and before experts_per_token.


244-245: LGTM: Kernel launcher call correctly updated.

The finalizeMoeRoutingKernelLauncher call is properly updated to include the orig_hidden_size parameter in the correct position.


252-254: LGTM: Function signature correctly updated.

The function signature is properly updated to include the orig_hidden_size_param parameter, maintaining consistency with the Torch binding pattern.


258-258: LGTM: Parameter extraction follows established pattern.

The orig_hidden_size extraction using guard_int is consistent with how other SymInt parameters are handled in the function.


282-282: LGTM: Output tensor shape correctly updated.

The output tensor allocation now properly uses orig_hidden_size for the second dimension, which is the key improvement allowing the operator to produce correctly sized outputs without manual slicing.


296-297: LGTM: All function calls consistently updated.

All calls to runMoEFinalizeScaleOp across different data types (float32, bfloat16, half) are correctly updated to include the orig_hidden_size parameter.

Also applies to: 308-309, 319-320


343-345: LGTM: Torch library binding correctly updated.

The library binding signature properly includes the orig_hidden_size parameter in the correct position, ensuring the new parameter is accessible from Python code.

cpp/tensorrt_llm/thop/moeOp.cpp (8)

242-243: LGTM: Optional parameter correctly added.

The original_hidden_size parameter is properly added as an optional parameter, maintaining backward compatibility while enabling the new functionality.


316-316: LGTM: Original hidden size calculation is correct.

The calculation properly handles the optional parameter, defaulting to hidden_size when not provided, ensuring backward compatibility.


370-370: LGTM: Output tensor shape correctly updated.

The output tensor shape now uses orig_hidden_size for the second dimension, which is the key improvement allowing proper output sizing without manual slicing.


391-391: LGTM: Kernel runner calls consistently updated.

Both conditional branches (OSS and non-OSS Cutlass) properly pass the orig_hidden_size parameter to the kernel runner, ensuring consistent behavior across implementations.

Also applies to: 405-405


423-424: LGTM: Optional parameter correctly added to min latency method.

The original_hidden_size parameter is consistently added to the runMoeMinLantency method, maintaining the same pattern as the regular runMoe method.


483-483: LGTM: Consistent original hidden size calculation.

The calculation follows the same pattern as in runMoe, properly handling the optional parameter with a sensible default.


520-520: LGTM: Output tensor shape correctly calculated for min latency mode.

The output tensor shape correctly uses orig_hidden_size while maintaining the proper multiplier for min latency mode (num_rows * num_experts_on_rank), which is different from the regular mode but appropriate for this method's semantics.


550-550: LGTM: Kernel runner calls consistently updated in min latency method.

Both conditional branches properly pass the orig_hidden_size parameter to the kernel runner, maintaining consistency across OSS and non-OSS implementations.

Also applies to: 564-564

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7)

24-24: LGTM! Proper use of C++ standard header.

The addition of <cstdint> follows the coding guidelines by using the C++ standard header instead of the C header <stdint.h>.


455-459: Interface extension properly handles original vs padded hidden size.

The addition of orig_hidden_size parameter to the virtual runMoe method is well-placed and follows the existing parameter conventions. This change aligns with the PR objective of distinguishing between padded and original hidden sizes.


484-489: Consistent parameter addition to gemm2 interface.

The orig_hidden_size parameter addition to gemm2 maintains consistency with the runMoe method signature change.


611-615: Correct implementation of interface override.

The runMoe override properly implements the updated interface with the new orig_hidden_size parameter.


646-650: Consistent parameter propagation in gemm2 methods.

Both the static gemm2 method and its override wrapper correctly propagate the orig_hidden_size parameter, maintaining consistency throughout the call chain.

Also applies to: 683-699


836-838: BlockScaleFC2 signature properly updated.

The static BlockScaleFC2 method for FP8 block scaling correctly includes the orig_hidden_size parameter in the same position as other methods.


959-959: Verify initialization of mExpertOrigHiddenSize in GemmProfilerBackend::init()

The new member mExpertOrigHiddenSize is named and zero-initialized correctly according to our C++ coding guidelines. However, I wasn’t able to locate any assignment to it in the implementation of GemmProfilerBackend::init().

Please manually verify in your C++ sources that GemmProfilerBackend::init() includes something like:

mExpertOrigHiddenSize = /* appropriate hidden size value */;

– Search for GemmProfilerBackend::init() across your .cpp/.cc files.
– Confirm mExpertOrigHiddenSize is initialized alongside mExpertHiddenSize.

Copy link
Collaborator

@djns99 djns99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple of suggestions and comments.
It might also be valuable to consider adding an option to also fuse the padding with expandInputRowsKernel, but this is not really required for this MR

Would it be possible to update the C++ tests to add coverage for this case?

@bobboli bobboli force-pushed the cutlass_fuse_slicing branch from 7e1b005 to fc320a6 Compare August 8, 2025 06:13
@bobboli bobboli force-pushed the cutlass_fuse_slicing branch from 998d64b to b12be8d Compare August 13, 2025 13:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (8)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)

167-183: Add runtime validation for unpadded_hidden_size.

Guard against invalid values early to surface clear errors and avoid mismatches downstream.

     else:
         assert tuner_num_tokens is None
         assert tuner_top_k is None
         tuner_input = input
         tuner_top_k = token_selected_experts.size(1)
 
+    # Validate unpadded_hidden_size if provided
+    if unpadded_hidden_size is not None:
+        if not isinstance(unpadded_hidden_size, int):
+            raise TypeError("unpadded_hidden_size must be an int when provided")
+        max_hidden = int(fc2_expert_weights.shape[1])
+        if not (0 < unpadded_hidden_size <= max_hidden):
+            raise ValueError(
+                f"unpadded_hidden_size ({unpadded_hidden_size}) must be in (0, {max_hidden}]"
+            )
+
     # allocate workspace for profiling
     moe_runner = MoERunner(

237-266: Fake schema is missing tuner_num_tokens and tuner_top_k; must match custom op exactly.

The fake registration omits tuner_num_tokens and tuner_top_k, which exist in the real op. Torch fake must match the op schema positionally and in defaults. This will break torch.compile and fake mode.

Apply this diff to align the schema and prepare for using unpadded_hidden_size in shape inference:

 @torch.library.register_fake("trtllm::fused_moe")
 def _(
     input: torch.Tensor,
     token_selected_experts: torch.Tensor,
     token_final_scales: torch.Tensor,
     fc1_expert_weights: torch.Tensor,
     fc1_expert_biases: Optional[torch.Tensor],
     fc2_expert_weights: torch.Tensor,
     fc2_expert_biases: Optional[torch.Tensor],
     output_dtype: torch.dtype,
     quant_scales: List[torch.Tensor],
     input_sf: Optional[torch.Tensor] = None,
     swizzled_input_sf: bool = True,
     swiglu_alpha: Optional[torch.Tensor] = None,
     swiglu_beta: Optional[torch.Tensor] = None,
     swiglu_limit: Optional[torch.Tensor] = None,
     tp_size: int = 1,
     tp_rank: int = 0,
     ep_size: int = 1,
     ep_rank: int = 0,
     cluster_size: int = 1,
     cluster_rank: int = 0,
     enable_alltoall: bool = False,
     use_deepseek_fp8_block_scale: bool = False,
     use_w4_group_scaling: bool = False,
     use_mxfp8_act_scaling: bool = False,
     min_latency_mode: bool = False,
     tune_max_num_tokens: int = 8192,
+    tuner_num_tokens: Optional[int] = None,
+    tuner_top_k: Optional[int] = None,
     unpadded_hidden_size: Optional[int] = None,
 ):

267-283: Use unpadded_hidden_size in fake tensor shape inference.

When slicing is fused, the fake should reflect the unpadded output hidden size to keep Compile/FX shapes consistent. Otherwise, graph shape will be padded while runtime returns sliced tensors.

Apply this diff to make shapes consistent:

 def _(
@@
 ):
-    seq_len = input.shape[0]
-    hidden_size = fc2_expert_weights.shape[1]
+    seq_len = int(input.shape[0])
+    padded_hidden_size = int(fc2_expert_weights.shape[1])
+    if unpadded_hidden_size is not None:
+        if not (0 < int(unpadded_hidden_size) <= padded_hidden_size):
+            raise ValueError(
+                f"unpadded_hidden_size ({unpadded_hidden_size}) must be in (0, {padded_hidden_size}]"
+            )
+    out_hidden_size = int(unpadded_hidden_size) if unpadded_hidden_size is not None else padded_hidden_size
 
     if min_latency_mode:
         num_experts_on_rank = fc2_expert_weights.shape[0]
-        output_shape = [seq_len * num_experts_on_rank, hidden_size]
+        output_shape = [seq_len * num_experts_on_rank, out_hidden_size]
         experts_to_token_score_shape = [num_experts_on_rank, seq_len]
         active_expert_global_ids_shape = [num_experts_on_rank]
         return [
             input.new_empty(output_shape, dtype=output_dtype),
             input.new_empty([1], dtype=torch.int32),
             input.new_empty(experts_to_token_score_shape, dtype=torch.float32),
             input.new_empty(active_expert_global_ids_shape, dtype=torch.int32),
         ]
     else:
-        return [input.new_empty([seq_len, hidden_size], dtype=output_dtype)]
+        return [input.new_empty([seq_len, out_hidden_size], dtype=output_dtype)]

692-702: Bug: register_fake references undefined variable act_fp8.

The fake for w4a8_mxfp4_fp8_gemm defines act_fp4 but returns act_fp8.new_empty(...), which will raise NameError when invoked.

Apply this diff to align the parameter name with the custom op and fix the reference:

 @w4a8_mxfp4_fp8_gemm.register_fake
 def _(
-    act_fp4: torch.Tensor,
+    act_fp8: torch.Tensor,
     weight: torch.Tensor,
     act_sf: torch.Tensor,
     weight_scale: torch.Tensor,
     alpha: torch.Tensor,
     output_dtype: torch.dtype,
     to_userbuffers: bool = False,
 ) -> torch.Tensor:
-    return act_fp8.new_empty((act_fp8.size(0), weight.size(0)),
-                             dtype=output_dtype)
+    return act_fp8.new_empty((act_fp8.size(0), weight.size(0)),
+                             dtype=output_dtype)

1-1: Add NVIDIA SPDX copyright header to Python source

The file tensorrt_llm/_torch/custom_ops/torch_custom_ops.py is missing the required NVIDIA header. Insert the canonical two‐line SPDX header above all code (i.e. before the first from):

File: tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
  from functools import lru_cache
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)

1791-1810: Finalize kernel: alignment assertions are too weak; tie them to the vectorization width.

padded/orig are asserted to be multiples of 4, but the kernel computes FINALIZE_ELEM_PER_THREAD as 128 / min(sizeof(OutputType), sizeof(GemmOutputType)). For half/bf16 it’s 8, for float it’s 4. If padded/orig are multiples of 4 but not 8 (half path), vectorized loads/stores will silently truncate columns due to integer division.

Strengthen the assertions to match FINALIZE_ELEM_PER_THREAD (and place them after it’s computed).

Apply:

-    assert(padded_cols % 4 == 0);
-    assert(orig_cols % 4 == 0);
-    assert(orig_cols <= padded_cols);
+    // After FINALIZE_ELEM_PER_THREAD is defined below, assert col counts align with the vector width

Then immediately after FINALIZE_ELEM_PER_THREAD is defined:

     constexpr int64_t FINALIZE_ELEM_PER_THREAD
         = 128 / std::min(sizeof_bits<OutputType>::value, sizeof_bits<GemmOutputType>::value);
+    assert(padded_cols % FINALIZE_ELEM_PER_THREAD == 0);
+    assert(orig_cols % FINALIZE_ELEM_PER_THREAD == 0);
+    assert(orig_cols <= padded_cols);

1870-1880: No-Filling finalize kernel: align the assertions with vectorization width.

Same concern as the filling kernel: replace the hard-coded “%4==0” checks with % FINALIZE_ELEM_PER_THREAD == 0, placed after FINALIZE_ELEM_PER_THREAD is computed.

Apply:

-    assert(padded_cols % 4 == 0);
-    assert(orig_cols % 4 == 0);
-    assert(orig_cols <= padded_cols);
+    // see below after FINALIZE_ELEM_PER_THREAD for alignment checks

And after FINALIZE_ELEM_PER_THREAD is defined:

+    assert(padded_cols % FINALIZE_ELEM_PER_THREAD == 0);
+    assert(orig_cols % FINALIZE_ELEM_PER_THREAD == 0);
+    assert(orig_cols <= padded_cols);

3216-3331: gemm2: add runtime validation for hidden sizes; and mirror the stronger col-alignment checks.

Before launching finalize kernels, guard that orig_hidden_size <= hidden_size. Optionally, when not using TMA fused finalize, assert 4-element alignment for both (mirrors the launcher checks).

Apply:

 void CutlassMoeFCRunner<...>::gemm2(...,
-    int64_t const expanded_num_rows, int64_t const hidden_size, int64_t const orig_hidden_size,
+    int64_t const expanded_num_rows, int64_t const hidden_size, int64_t const orig_hidden_size,
     int64_t const inter_size, ...)
 {
+    TLLM_CHECK_WITH_INFO(orig_hidden_size <= hidden_size,
+        "orig_hidden_size (%ld) must be <= hidden_size (%ld)", (long) orig_hidden_size, (long) hidden_size);
+    // Common alignment sanity (exact vector width is type-dependent and enforced in kernel)
+    TLLM_CHECK_WITH_INFO((hidden_size % 4) == 0 && (orig_hidden_size % 4) == 0,
+        "Finalize expects 4-element alignment for hidden sizes");
🧹 Nitpick comments (6)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)

74-81: Return type inconsistency: List[int] vs range.

get_valid_tactics is annotated to return List[int] but returns a range. Minor, but can confuse type checkers and callers.

-        return range(self.fused_moe_runner.get_tactic_num())
+        return list(range(self.fused_moe_runner.get_tactic_num()))

632-644: Default mismatch between op and fake (out_dtype).

fp8_batched_gemm_trtllmgen op defaults out_dtype to torch.half, but the fake defaults to None. Keep defaults identical to avoid tracing/fake inconsistencies.

-    out_dtype: Optional[torch.dtype] = None
+    out_dtype: Optional[torch.dtype] = torch.half

119-151: Reminder: router scales must be non-null when finalize fusion is enabled (CUTLASS epilogue contract).

This op requires token_final_scales (router scales). Ensure callers never pass None here when the finalize fusion is active, since downstream epilogue expects a valid pointer.

If you want, I can scan the Python call sites to ensure token_final_scales is always provided when using the fused finalize path.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

1-15: Nit: Copyright year.

This header still says 2020-2023. Other files are already at 2020-2025. Consider updating for consistency.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

1976-2010: Launcher: host-side guards would improve failure modes.

You already assert inside kernels; adding host-side checks makes errors fail fast (before launch) and with clearer messages.

Apply:

 void finalizeMoeRoutingKernelLauncher(..., int64_t const num_rows,
     int64_t const padded_cols, int64_t const orig_cols, int64_t const experts_per_token,
     int64_t const num_experts_per_node, ...)
 {
+    TLLM_CHECK_WITH_INFO(orig_cols <= padded_cols, "orig_cols (%ld) must be <= padded_cols (%ld)",
+        (long) orig_cols, (long) padded_cols);
+    // The kernel vectorization width depends on OutputType/GemmOutputType; we can't compute it here.
+    // Still, cheaply assert common alignments.
+    TLLM_CHECK_WITH_INFO((padded_cols % 4) == 0 && (orig_cols % 4) == 0,
+        "Expect 4-element alignment for finalize paths");

2032-2034: Nit: unify parameter naming (actual_cols vs orig_cols).

The explicit instantiation uses “actual_cols” while the function uses “orig_cols”. Consider renaming for consistency.

-        int64_t const actual_cols, int64_t const experts_per_token, int64_t const num_experts_per_node,
+        int64_t const orig_cols, int64_t const experts_per_token, int64_t const num_experts_per_node,
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 998d64b and b12be8d.

📒 Files selected for processing (12)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (9 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (21 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (10 hunks)
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp (6 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • cpp/tensorrt_llm/thop/moeOp.cpp
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧠 Learnings (5)
📓 Common learnings
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.893Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.893Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🔇 Additional comments (9)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

231-233: Native runner ABI matches updated Python calls

Verified that both FusedMoeRunner::runMoe (lines 233–243) and FusedMoeRunner::runMoeMinLantency (lines 414–424) in cpp/tensorrt_llm/thop/moeOp.cpp include the trailing unpadded_hidden_size parameter. The Python calls in torch_custom_ops.py now align with the C++ ABI. Approved.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1)

69-71: All finalizeMoeRoutingKernelLauncher calls now include both padded_cols and orig_cols in the correct order

I’ve checked every invocation in:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp (runMoEFinalizeScaleOp)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (device‐side and fused‐epilogue paths)

Each passes
… , num_rows, hidden_size (padded_cols), orig_hidden_size (orig_cols), experts_per_token, num_experts_per_node, …

No stale or mis-ordered calls remain.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)

24-24: Good addition of include.

Needed for the newly threaded int64_t params across interfaces.


455-460: runMoe interface: new orig_hidden_size placement is consistent.

Placing orig_hidden_size immediately after hidden_size is clear and minimizes churn downstream.


836-839: BlockScaleFC2 signature sync looks good.

Passing both hidden_size and orig_hidden_size downstream to finalization is necessary for padding-aware writes.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)

1842-1849: Bias stride assumes bias is padded; verify allocation/layout.

bias_ptr = bias_v + expert_id * num_elems_in_padded_col; strides by padded_cols, not orig_cols. This is correct only if fc2 biases are materialized per-expert at the padded width. If biases remain unpadded (orig_cols), this will read OOB.

Confirm bias tensors are allocated/laid out with padded_cols. If not, stride using num_elems_in_orig_col and gate accesses accordingly.

Would you like me to scan plugin/runner allocation paths in the repo to confirm the fc2 bias buffers are padded to hidden_size?


1927-1931: LGTM: bounds guard prevents writes beyond orig_cols.

Looping over padded elements and early-continue for indices >= orig-cols is a safe pattern for the no-filling path.


2985-3003: BlockScaleFC2 → finalize: threading (hidden_size, orig_hidden_size) is correct.

This correctly passes padded and original widths to the launcher. No issues spotted.


3999-4011: Fused finalize epilogue: router scales are correctly non-null; validate output shape parameters.

Good: setFinalizeFusionParams receives permuted_token_final_scales_ (non-null when fusion enabled), which satisfies CUTLASS fusion’s implicit contract.

Minor: please re-check the ordering of (hidden_size, num_rows). Given the transpose in the fused epilogue, rows map to hidden_size and columns to num_output_tokens; num_rows here should be num_output_tokens.

If helpful, I can cross-check the TMA setFinalizeFusionParams signature in the CUTLASS extension and confirm the expected argument order.

@bobboli bobboli force-pushed the cutlass_fuse_slicing branch from b12be8d to f6d9323 Compare August 13, 2025 15:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)

959-963: Initialize mExpertOrigHiddenSize to prevent undefined behavior

The mExpertOrigHiddenSize member is declared but never initialized in the init() method, which will cause the profiler's gemm2 to receive 0 as orig_hidden_size, breaking finalization when no columns are written.

Apply this diff to initialize the member properly:

 void init(..., int64_t hidden_size, int64_t inter_size, int64_t group_size, ...)
 {
     ...
     mExpertHiddenSize = hidden_size;
+    mExpertOrigHiddenSize = hidden_size; // default: no padding unless caller overrides later
     ...
 }

Additionally, consider adding a public setter method if the original hidden size needs to be configured differently:

+void setOrigHiddenSize(int64_t orig_hidden_size)
+{
+    mExpertOrigHiddenSize = orig_hidden_size;
+}

484-490: Add runtime validation for orig_hidden_size

Consider adding a runtime check to ensure orig_hidden_size <= hidden_size to prevent potential buffer overrun issues when the original size exceeds the padded size.

Apply this diff to add validation in the implementation:

 virtual void gemm2(...,
         int64_t const hidden_size, int64_t const orig_hidden_size, int64_t const inter_size,
         int const num_experts_per_node, ...
 ) = 0;
+// In the implementation (moe_kernels.cu), add:
+TLLM_CHECK_WITH_INFO(orig_hidden_size <= hidden_size,
+    "orig_hidden_size (%ld) must be <= hidden_size (%ld)", (long) orig_hidden_size, (long) hidden_size);
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

4800-4801: Profiler: ensure mExpertOrigHiddenSize is initialized before use

You now pass mExpertOrigHiddenSize to gemm2. Confirm it's set in GemmProfilerBackend::init (and/or constructor) alongside mExpertHiddenSize to prevent UB.

Search for initialization:

#!/bin/bash
rg -n "mExpertOrigHiddenSize"
rg -n "GemmProfilerBackend::init"
🧹 Nitpick comments (5)
cpp/tensorrt_llm/thop/moeOp.cpp (2)

246-248: API naming consistency: prefer orig_hidden_size over unpadded_hidden_size

Everywhere else in the stack (runners/kernels/profiler) refers to this concept as orig_hidden_size. Consider renaming the new argument from unpadded_hidden_size to orig_hidden_size for consistency across the public surface.

-        bool const enable_alltoall, bool min_latency_mode, torch::optional<c10::ArrayRef<int64_t>> const& profile_ids,
-        torch::optional<int64_t> const& unpadded_hidden_size)
+        bool const enable_alltoall, bool min_latency_mode, torch::optional<c10::ArrayRef<int64_t>> const& profile_ids,
+        torch::optional<int64_t> const& orig_hidden_size)

And similarly for runMoeMinLantency.

Also applies to: 427-429


2-2: Update copyright year

Coding guidelines require the current year. This file shows 2022-2024; please update to 2025.

- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)

1883-1889: NoFilling finalize: avoid per-iteration bounds check for better throughput

Minor micro-optimization: iterate only up to num_elems_in_orig_col to drop the inner branch.

-        for (int elem_index = start_offset; elem_index < num_elems_in_padded_col; elem_index += stride)
-        {
-            if (elem_index >= num_elems_in_orig_col)
-                continue; // Skip writing beyond original columns
+        int64_t const loop_end = num_elems_in_orig_col;
+        for (int elem_index = start_offset; elem_index < loop_end; elem_index += stride)
+        {
             ...
         }

Also applies to: 1929-1931, 1941-1944, 1962-1971


1886-1889: Mirror device asserts with host-side checks

Kernels assert padded_cols % 4 == 0 and orig_cols % 4 == 0. Add matching host checks where orig_hidden_size is computed (moeOp.cpp) to fail fast and aid debugging (see suggested patch in moeOp.cpp).

Also applies to: 1806-1809


50-55: Header year is up-to-date; keep consistent across files

This file already shows 2020-2025. Please align moeOp.cpp accordingly.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b12be8d and f6d9323.

📒 Files selected for processing (12)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (9 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (22 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (10 hunks)
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp (6 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (4 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (20)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)

24-24: LGTM!

The addition of <cstdint> header is appropriate for using standard fixed-width integer types like int64_t.


455-455: LGTM!

The addition of orig_hidden_size parameter after hidden_size is consistent with the PR's objective of threading original dimensions through the MoE operator stack for fused slicing operations.


612-612: LGTM!

The propagation of orig_hidden_size through the runMoe and gemm2 method signatures in CutlassMoeFCRunner is consistent with the interface changes and maintains proper parameter threading.

Also applies to: 647-700


840-842: LGTM!

The addition of orig_hidden_size parameter to BlockScaleFC2 is properly integrated and aligns with the overall MoE operator changes.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)

576-577: LGTM!

The passing of both padded hidden_size and original unpadded_hidden_size to the finalize operation correctly enables proper output sizing within the kernel.

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2)

466-466: LGTM!

The addition of orig_hidden_size: torch.SymInt parameter to the fake operator signature is properly positioned and typed.


475-477: LGTM!

The output tensor allocation now correctly uses orig_hidden_size for the second dimension, ensuring the output matches the original unpadded dimensions rather than the potentially padded hidden size.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)

90-91: Consider using a more descriptive name for clarity

The variable name unpadded_hidden_size clearly indicates its purpose. This is better than original_hidden_size which could be ambiguous in the context of tensor parallelism where "original" might refer to the pre-TP-split size.


425-425: LGTM!

The addition of unpadded_hidden_size parameter to the fused_moe operator correctly propagates the original dimensions to the kernel for proper output sizing.


406-406: Confirm swizzled_input_sf behavior in fused_moe_cutlass

I traced the defaults and usage across layers:

• In torch_custom_ops.py, swizzled_input_sf: bool = True
• In fused_moe_wide_ep.py, default passed is False
• In fused_moe_cutlass.py, default passed is False
• In the CUTLASS kernel signature, the default is swizzled_input_sf = true, and the code path under if (swizzled_input_sf) applies a “swizzled” offset layout, while the else path treats input_sf as a flat, contiguous array.

By flipping to False in fused_moe_cutlass.py, you’ll hit the contiguous‐layout branch. Please verify:

  • That the input_sf tensor you pass in this path is laid out contiguously (per-expert scales in row-major order), not in the swizzled pattern.
  • That downstream code (and any existing quantization tests) covers this non-swizzled branch to avoid silent scaling errors.

Consider either unifying the default across the Python wrappers or adding a comment/parameter doc to explain when to use each mode, and include a unit test for the non-swizzled case.

cpp/tensorrt_llm/thop/moeUtilOp.cpp (4)

237-237: LGTM!

The addition of orig_hidden_size parameter to runMoEFinalizeScaleOp and its propagation through the kernel launcher is correctly implemented.

Also applies to: 244-245, 252-253, 258-258


282-282: LGTM!

The output tensor allocation now correctly uses orig_hidden_size for the second dimension, ensuring proper output sizing for unpadded data.


296-296: LGTM!

The propagation of orig_hidden_size through all three data type paths (float32, bfloat16, half) is consistent and correct.

Also applies to: 308-308, 319-319


343-344: LGTM!

The updated TORCH_LIBRARY_FRAGMENT registration correctly includes the new SymInt orig_hidden_size parameter in the function signature.

cpp/tensorrt_llm/thop/moeOp.cpp (2)

374-376: Output tensor shape switch to orig_hidden_size looks good

Switching the output shape to {num_rows, orig_hidden_size} and {num_rows * num_experts_on_rank, orig_hidden_size} aligns with the slicing objective.

Also applies to: 524-526


395-413: All runMoe and gemm2 invocations now include orig_hidden_size

• In moeOp.cpp, runMoe(...) calls at lines 386, 400, 545, and 559 pass orig_hidden_size.
• In mixtureOfExpertsPlugin.cpp, runMoe(...) calls at lines 961 and 974 pass mExpertHiddenSize for both padded and unpadded sizes.
• In cutlass_kernels/include/moe_kernels.h, the gemm2 override signature and its call to Self::gemm2(…, hidden_size, orig_hidden_size, inter_size, …) have been updated accordingly.

No further call sites are missing the new parameter.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)

1799-1805: Finalize kernel: padded/original column handling is correct

  • Using padded_cols for strides and orig_cols for iteration bounds prevents OOB while keeping coalesced reads.
  • Bias pointer stride based on padded_cols is consistent.

Also applies to: 1818-1822, 1836-1836, 1855-1862


1989-1992: Launcher: dual-dimension plumb-through is correct

The launcher now supplies both padded_cols and orig_cols to the kernels in both A2A and non-A2A paths. Looks good.

Also applies to: 2021-2023, 2035-2037


2998-3003: BlockScaleFC2: finalize with padded vs original columns wired correctly

Forwarding both hidden_size and orig_hidden_size to the finalize launcher ensures sliced outputs with padded GEMM strides.

Also applies to: 3011-3016


2050-2055: Template instantiations updated for new launcher signature

Macro instantiation reflects the new parameter order (padded vs orig). Good.

@bobboli
Copy link
Collaborator Author

bobboli commented Aug 13, 2025

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15166 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1)

65-71: Add sanity checks for padded vs. unpadded columns in finalizeMoeRoutingKernelLauncher

We verified that the signature is consistent across the header, CU implementation, and all call sites (no legacy single-cols calls remain). To guard against misconfiguration when fusing slicing into the final kernel, insert explicit checks at the top of the implementation:

• File:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
in the body of template<…> void finalizeMoeRoutingKernelLauncher(...) (around line 1985)

• Add:

// Sanity checks for new padded/unpadded API
TORCH_CHECK(padded_cols >= unpadded_cols,
            "finalizeMoeRoutingKernelLauncher: padded_cols (", padded_cols,
            ") must be >= unpadded_cols (", unpadded_cols, ")");
TORCH_CHECK(unpadded_cols > 0,
            "finalizeMoeRoutingKernelLauncher: unpadded_cols must be > 0");
// Optional: enforce vector‐load alignment, e.g. for float4 loads:
// TORCH_CHECK(unpadded_cols % 4 == 0,
//             "unpadded_cols must be a multiple of 4 for vectorized loads");

This prevents silent errors when callers pass mismatched dimensions.

cpp/tensorrt_llm/thop/moeOp.cpp (2)

374-399: Guard fused-finalize requires non-null router scales (token_final_scales) to avoid kernel faults

When fused finalize is enabled in the CUTLASS path, router scales must be non-null. Enforce this precondition on the host path to avoid undefined behavior.

Apply this diff:

 // setRunnerProfiles(profile_ids);
 auto stream = at::cuda::getCurrentCUDAStream(input.get_device());
 
+if (mUseFusedFinalize) {
+    TORCH_CHECK(token_final_scales.has_value(),
+        "token_final_scales must be provided when fused finalize is enabled.");
+}

544-558: Min-latency path: enforce non-null router scales if fused finalize is active

Same rationale as the regular path; prevent device-side faults caused by missing scales when finalization is fused.

Apply this diff:

 kernels::MoeMinLatencyParams min_latency_params{};
 min_latency_params.num_active_experts_per_node = static_cast<int*>(num_active_experts_per_node.data_ptr());
 ...
 
+if (mUseFusedFinalize) {
+    TORCH_CHECK(token_final_scales.has_value(),
+        "token_final_scales must be provided when fused finalize is enabled.");
+}
♻️ Duplicate comments (5)
cpp/tensorrt_llm/thop/moeOp.cpp (2)

319-323: Add bounds/alignment checks for unpadded_hidden_size before use (runMoe)

Fail fast if the host passes an invalid unpadded size; this prevents device-side faults and misaligned vectorization in finalize.

Apply this diff:

 int64_t hidden_size = fc2_expert_weights.sizes()[1];
 int64_t unpadded_hidden_size_val = unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size;
+TORCH_CHECK(unpadded_hidden_size_val > 0 && unpadded_hidden_size_val <= hidden_size,
+    "unpadded_hidden_size must be in (0, hidden_size]. Got ",
+    unpadded_hidden_size_val, " vs hidden_size ", hidden_size, ".");
+// Kernels vectorize on at least 128 bits per thread. Enforce minimal alignment used by finalize (multiple of 4).
+TORCH_CHECK(unpadded_hidden_size_val % 4 == 0,
+    "unpadded_hidden_size (", unpadded_hidden_size_val, ") must be divisible by 4 to satisfy vectorized finalize.");

485-489: Add bounds/alignment checks for unpadded_hidden_size before use (runMoeMinLantency)

Mirror the validation in the min-latency path to keep behavior consistent.

Apply this diff:

 int64_t hidden_size = fc2_expert_weights.sizes()[1];
 int64_t unpadded_hidden_size_val = unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size;
+TORCH_CHECK(unpadded_hidden_size_val > 0 && unpadded_hidden_size_val <= hidden_size,
+    "unpadded_hidden_size must be in (0, hidden_size]. Got ",
+    unpadded_hidden_size_val, " vs hidden_size ", hidden_size, ".");
+TORCH_CHECK(unpadded_hidden_size_val % 4 == 0,
+    "unpadded_hidden_size (", unpadded_hidden_size_val, ") must be divisible by 4 to satisfy vectorized finalize.");
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

908-944: Bug: mExpertOrigHiddenSize is never initialized in profiler; GEMM2 profiling sees 0

Initialize mExpertOrigHiddenSize alongside mExpertHiddenSize to a sane default (hidden_size) to avoid zero-width finalization during profiling.

Apply this diff:

@@ void init(CutlassMoeFCRunnerInterface& runner, GemmToProfile gemm_to_profile, nvinfer1::DataType dtype,
-        mExpertHiddenSize = hidden_size;
+        mExpertHiddenSize = hidden_size;
+        // Default to padded hidden size unless caller overrides with actual unpadded/original.
+        mExpertOrigHiddenSize = hidden_size;
         mExpertInterSize = inter_size; // Already divided by tp_size

Also applies to: 963-964

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

3277-3279: Potential fused finalize epilogue memory safety issue remains.

Based on the past review comment, there's still a concern about the fused finalize epilogue potentially writing beyond orig_hidden_size when setFinalizeFusionParams receives hidden_size. The comment suggests this needs to be addressed by either:

  1. Extending the API to accept both padded and original column counts
  2. Using a temporary buffer approach

4799-4802: Profiler: gemm2 now passes mExpertOrigHiddenSize; ensure it's initialized.

This depends on mExpertOrigHiddenSize being set in GemmProfilerBackend::init. See my fix suggested in moe_kernels.h.

🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

647-701: Consider passing MOEParallelismConfig by const reference in virtuals/overrides

Large trivially-copyable structs can still incur copies; passing by const& reduces churn across this hot path. This is optional and can be deferred.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6d9323 and 1d06628.

📒 Files selected for processing (7)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (9 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (22 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (2 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (10 hunks)
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp (6 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧠 Learnings (5)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (15)
cpp/tensorrt_llm/thop/moeUtilOp.cpp (3)

241-246: Forwarding order to launcher looks correct; keep padded/unpadded semantics consistent

Passing (hidden_size, unpadded_hidden_size) aligns with the new launcher’s (padded_cols, unpadded_cols) expectation. No changes needed here after adding the early validation above.

If you want additional certainty, run the repository-wide grep (previous comment) to confirm no stale call sites remain.


341-347: All Python wrappers updated with new unpadded_hidden_size parameter

Both fused_moe_deepgemm.py and fused_moe_cute_dsl.py now call torch.ops.trtllm.moe_finalize_scale_op with 17 arguments (including the inserted unpadded_hidden_size in the 11th position), matching the C++ schema. No legacy 12-arg invocations remain.


282-282: Confirm downstream consumers handle unpadded width

Please verify that all downstream users of moe_finalize_scale_op—including the two Python wrappers—are updated to expect the new output shape {num_rows, unpadded_hidden_size} instead of the padded width.

Key callsites to review:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (lines 564–573)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (lines 227–236)

Check any tests or TorchScript modules that consume final_hidden_states and adjust shape assertions or downstream tensor operations accordingly.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)

24-24: Include is appropriate for the new int64_t API surface

Good addition to make 64-bit usage explicit and portable.


455-460: No mismatches found: unpadded_hidden_size is consistently declared and passed

All public/virtual/static methods now include unpadded_hidden_size in their signatures, and every callsite passes it immediately after hidden_size. No parameter order or naming inconsistencies were detected across:

  • C++ headers (moe_kernels.h variants)
  • Kernel implementations (moe_kernels.cu, moeUtilOp.cpp, moeOp.cpp)
  • Python/C++ custom-op bindings and tests

No changes required.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (10)

1803-1804: Pass both padded and original column counts to enable proper memory bounds checking.

The function signature now correctly accepts both padded_cols (for reading from padded GEMM output) and unpadded_cols (for writing to the original output size). This addresses the slicing optimization mentioned in the PR title.


1820-1822: Correct loop bounds calculation using original columns.

The loop now correctly uses num_elems_in_orig_col derived from unpadded_cols for the write loop bounds, preventing out-of-bounds writes. This is the core fix for the slicing optimization.


1855-1855: Ensure bias indexing matches padded column layout.

The bias pointer calculation uses num_elems_in_padded_col which is correct since bias tensors are typically allocated with the same padding as the GEMM output tensors.

Also applies to: 1861-1861


1883-1888: Same padding-aware parameter additions to NoFilling variant.

The NoFilling kernel variant correctly receives the same padded_cols and unpadded_cols parameters, maintaining consistency across both finalization paths.


1940-1943: Guard against out-of-bounds writes in padded dimension iteration.

The conditional check if (elem_index >= num_elems_in_orig_col) continue; correctly prevents writes beyond the original column boundaries when iterating over the padded dimension. This is essential for memory safety.


1962-1962: Verify bias indexing consistency in NoFilling kernel.

Similar to the regular finalize kernel, bias access uses num_elems_in_padded_col which should be consistent with how bias tensors are allocated and accessed throughout the MOE pipeline.

Also applies to: 1969-1969


1989-1991: Update all kernel launcher call sites with new padding parameters.

All call sites to finalizeMoeRoutingKernelLauncher now correctly pass both padded_cols and unpadded_cols parameters. The parameter order and types appear consistent.

Also applies to: 2021-2022, 2035-2036


2998-3000: DeepSeek block scale functions updated with original hidden size parameter.

The BlockScaleFC2 function signature now includes unpadded_hidden_size parameter and correctly passes it through to the finalization kernels. This maintains consistency with the slicing optimization.

Also applies to: 3014-3015


3554-3559: runMoe function signature properly extended with unpadded_hidden_size parameter.

The main MOE entry point now correctly accepts unpadded_hidden_size as a separate parameter after hidden_size, enabling the slicing optimization throughout the MOE pipeline.


3720-3724: All gemm2 calls consistently pass unpadded_hidden_size parameter.

Both the min_latency_mode and regular execution paths correctly thread the unpadded_hidden_size parameter through to the gemm2 function calls, ensuring consistent slicing behavior.

Also applies to: 3827-3830

@bobboli
Copy link
Collaborator Author

bobboli commented Aug 13, 2025

/bot kill

@bobboli
Copy link
Collaborator Author

bobboli commented Aug 13, 2025

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15171 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15166 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15172 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15171 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15172 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit fde3f32

@bobboli
Copy link
Collaborator Author

bobboli commented Aug 14, 2025

/bot run --disable-fail-fast --add-multi-gpu-test

bobboli and others added 9 commits August 22, 2025 15:08
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
@bobboli bobboli force-pushed the cutlass_fuse_slicing branch from b2a9b93 to 69164a4 Compare August 22, 2025 07:08
@bobboli
Copy link
Collaborator Author

bobboli commented Aug 22, 2025

/bot run --reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16147 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)

998-1009: Non-OSS runMoe arity mismatch (extra hidden_size and enable_alltoall arg)

Internal Cutlass runMoe takes only one hidden_size and does not accept enable_alltoall; current call will fail to compile on the non-OSS path.

Apply this diff:

-            mExpertBias2 + mExpertBias2Size * mBufferIndex, mQuantParams[mBufferIndex], mTotalTokens, mHiddenSize,
-            mHiddenSize, mInterSize, mNumExperts, mK, mWorkspace + mWorkspaceSize * mBufferIndex,
-            mFinalOutput + mFinalOutputSize * mBufferIndex,
-            mSourceToExpandedMap + mSourceToExpandedMapSize * mBufferIndex, parallelism_config,
-            /*enable_alltoall=*/false, mUseLora, mLoraParams[mBufferIndex],
+            mExpertBias2 + mExpertBias2Size * mBufferIndex, mQuantParams[mBufferIndex], mTotalTokens,
+            /*hiddenSize=*/mHiddenSize, mInterSize, mNumExperts, mK, mWorkspace + mWorkspaceSize * mBufferIndex,
+            mFinalOutput + mFinalOutputSize * mBufferIndex,
+            mSourceToExpandedMap + mSourceToExpandedMapSize * mBufferIndex, parallelism_config,
+            mUseLora, mLoraParams[mBufferIndex],
♻️ Duplicate comments (9)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)

986-997: OSS runMoe: disambiguate padded vs. unpadded hidden sizes

The call passes mHiddenSize twice. Annotate which is padded vs unpadded for readability and future padding support in this fixture.

-                mExpertBias2 + mExpertBias2Size * mBufferIndex, mQuantParams[mBufferIndex], mTotalTokens, mHiddenSize,
-                mHiddenSize, mInterSize, mNumExperts, mK, mWorkspace + mWorkspaceSize * mBufferIndex,
+                mExpertBias2 + mExpertBias2Size * mBufferIndex, mQuantParams[mBufferIndex], mTotalTokens,
+                /*paddedHiddenSize=*/mHiddenSize, /*unpaddedHiddenSize=*/mHiddenSize,
+                mInterSize, mNumExperts, mK, mWorkspace + mWorkspaceSize * mBufferIndex,

708-718: Non-OSS GemmProfilerBackend::init arity mismatch; OSS path should label padded vs. unpadded hidden sizes

  • Non-OSS: internal init does not take two hidden_size args nor enable_alltoall; current call will not compile.
  • OSS: passing mHiddenSize twice is ambiguous; annotate which is padded vs unpadded.

Apply this diff:

 #ifdef USING_OSS_CUTLASS_MOE_GEMM
-        mGemmProfilerBackend.init(mMoERunner, GemmProfilerBackend::GemmToProfile::Undefined, typeToDtypeID<DataType>(),
-            typeToDtypeID<WeightType>(), typeToDtypeID<OutputType>(), mNumExperts, mK, mHiddenSize, mHiddenSize,
-            mInterSize, mGroupSize, mActType, mUseBias, mUseLora, /*min_latency_mode=*/false,
-            /*need_weights=*/false, parallelism_config, /*enable_alltoall=*/false);
+        mGemmProfilerBackend.init(mMoERunner, GemmProfilerBackend::GemmToProfile::Undefined, typeToDtypeID<DataType>(),
+            typeToDtypeID<WeightType>(), typeToDtypeID<OutputType>(), mNumExperts, mK,
+            /*paddedHiddenSize=*/mHiddenSize, /*unpaddedHiddenSize=*/mHiddenSize,
+            mInterSize, mGroupSize, mActType, mUseBias, mUseLora, /*min_latency_mode=*/false,
+            /*need_weights=*/false, parallelism_config, /*enable_alltoall=*/false);
 #else
-        mGemmProfilerBackend.init(mMoERunner, GemmProfilerBackend::GemmToProfile::Undefined, typeToDtypeID<DataType>(),
-            typeToDtypeID<WeightType>(), typeToDtypeID<OutputType>(), mNumExperts, mK, mHiddenSize, mHiddenSize,
-            mInterSize, mGroupSize, mActType, mUseBias, mUseLora, /*min_latency_mode=*/false,
-            /*need_weights=*/false, parallelism_config);
+        mGemmProfilerBackend.init(mMoERunner, GemmProfilerBackend::GemmToProfile::Undefined, typeToDtypeID<DataType>(),
+            typeToDtypeID<WeightType>(), typeToDtypeID<OutputType>(), mNumExperts, mK,
+            /*hiddenSize=*/mHiddenSize,
+            mInterSize, mGroupSize, mActType, mUseBias, mUseLora, /*min_latency_mode=*/false,
+            /*need_weights=*/false, parallelism_config);
 #endif
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

12-12: Fix ceil_div import to prevent ImportError at runtime

ceil_div is defined under quantization/utils/fp8_utils.py, not in ...utils.

-from ...utils import AuxStreamType, EventType, Fp4QuantizedTensor, ceil_div
+from ...utils import AuxStreamType, EventType, Fp4QuantizedTensor
+from ...quantization.utils.fp8_utils import ceil_div
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

2065-2072: Gate GptOss120b test to OSS path to avoid false failures on non-OSS builds

This scenario relies on OSS epilogue honoring unpadded stride; skip when USING_OSS_CUTLASS_MOE_GEMM is not defined.

     TYPED_TEST(MixtureOfExpertsTest, ParallelismType##GptOss120b)                                                      \
     {                                                                                                                  \
+    /* Only meaningful with the OSS Cutlass MoE path (unpadded stride support) */                                      \
+    #ifndef USING_OSS_CUTLASS_MOE_GEMM                                                                                 \
+        GTEST_SKIP() << "Skipping GptOss120b: requires USING_OSS_CUTLASS_MOE_GEMM";                                    \
+        return;                                                                                                        \
+    #endif                                                                                                             \
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

459-468: API: runMoe now threads unpadded_hidden_size — verify all call sites are updated

The signature change looks consistent (unpadded_hidden_size after hidden_size, plus enable_alltoall). Please re-verify that all C++/Py bindings (plugins, THOPs, unit tests, micro-benchmarks) call this overload on OSS builds.

Run to spot any stale call sites still using the old arity:

#!/bin/bash
set -euo pipefail
echo "Searching runMoe(...) call sites"
rg -nP --type=cpp --type=cu --type=cuh -C3 '\brunMoe\s*\('
cpp/tensorrt_llm/thop/moeUtilOp.cpp (1)

252-259: Validate unpadded_hidden_size early (bounds)

Add fast-fail host-side checks to prevent device-side asserts when unpadded_hidden_size is invalid.

     int64_t num_rows = num_rows_param.guard_int(__FILE__, __LINE__);
     int64_t hidden_size = hidden_size_param.guard_int(__FILE__, __LINE__);
     int64_t unpadded_hidden_size = unpadded_hidden_size_param.guard_int(__FILE__, __LINE__);
+    TORCH_CHECK(
+        unpadded_hidden_size > 0 && unpadded_hidden_size <= hidden_size,
+        "unpadded_hidden_size must be in (0, hidden_size]. Got ",
+        unpadded_hidden_size, " vs hidden_size ", hidden_size, ".");
cpp/tensorrt_llm/thop/moeOp.cpp (2)

330-333: Validate unpadded_hidden_size_val in runMoe

Add host-side bounds checks to catch bad inputs before launching kernels.

-        int64_t unpadded_hidden_size_val
-            = unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size;
+        int64_t unpadded_hidden_size_val =
+            unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size;
+        TORCH_CHECK(
+            unpadded_hidden_size_val > 0 && unpadded_hidden_size_val <= hidden_size,
+            "unpadded_hidden_size must be in (0, hidden_size]. Got ",
+            unpadded_hidden_size_val, " vs hidden_size ", hidden_size, ".");
+        // Optional (if you want to enforce minimal vectorization constraints used by finalize kernels):
+        // TORCH_CHECK(unpadded_hidden_size_val % 4 == 0, "unpadded_hidden_size (",
+        //     unpadded_hidden_size_val, ") must be divisible by 4.");

505-507: Validate unpadded_hidden_size_val in runMoeMinLantency

Mirror the same bounds checks here to keep both paths consistent.

-        int64_t unpadded_hidden_size_val
-            = unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size;
+        int64_t unpadded_hidden_size_val =
+            unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size;
+        TORCH_CHECK(
+            unpadded_hidden_size_val > 0 && unpadded_hidden_size_val <= hidden_size,
+            "unpadded_hidden_size must be in (0, hidden_size]. Got ",
+            unpadded_hidden_size_val, " vs hidden_size ", hidden_size, ".");
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

1807-1809: Replace hard-coded alignment “4” with FINALIZE_ELEM_PER_THREAD.

The finalize kernels still assert 4-element alignment, which fails for 16-bit data types (half/bf16) that load 8 elements per 128-bit lane. You already compute FINALIZE_ELEM_PER_THREAD; use it in the asserts for correctness across types.

Apply:

@@
-    assert(padded_cols % 4 == 0);
-    assert(unpadded_cols % 4 == 0);
+    assert(padded_cols   % FINALIZE_ELEM_PER_THREAD == 0);
+    assert(unpadded_cols % FINALIZE_ELEM_PER_THREAD == 0);

Do the same in finalizeMoeRoutingNoFillingKernel.

Also applies to: 1887-1889

🧹 Nitpick comments (11)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

345-346: Wrap long comment to satisfy Ruff E501 (line > 120 chars)

Minor style fix to keep CI linting green.

-        is_sf_swizzled = True  # In case of post-quant communication, scaling factors will not be swizzled before communication, and swizzling after communication is merged into MoE.
+        # In case of post-quant communication, scaling factors will not be swizzled before communication,
+        # and swizzling after communication is merged into MoE.
+        is_sf_swizzled = True
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)

484-498: Minor type inconsistency: experts_per_token int vs int64_t across interfaces

Here experts_per_token is int64_t, while most other public methods use int. Prefer one type across the surface (int is sufficient and avoids accidental sign/width drift).

Minimal change (only if you choose to standardize to int here — remember to mirror in definitions/overrides):

-        int const num_experts_per_node, int64_t const experts_per_token, float const** alpha_scale_ptr_array,
+        int const num_experts_per_node, int const experts_per_token, float const** alpha_scale_ptr_array,

759-768: Setup path takes unpadded_hidden_size — consider brief param docs

Since both hidden_size and unpadded_hidden_size are carried, a short comment clarifying “hidden_size = padded cols used for GEMM; unpadded_hidden_size = true output cols” would prevent future misuse.


918-956: Profiler API extended: init stores mExpertUnpaddedHiddenSize

The added field and parameter are wired correctly. One suggestion: if the caller ever passes 0 (legacy paths), default to hidden_size inside init to keep the profiler robust.

Example:

-        mExpertUnpaddedHiddenSize = unpadded_hidden_size;
+        mExpertUnpaddedHiddenSize = (unpadded_hidden_size > 0) ? unpadded_hidden_size : hidden_size;

Also applies to: 966-975


1-15: Header copyright year

Coding guidelines request current year; this file still shows 2020–2023. Consider updating the range to include 2025.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1)

65-72: FinalizeMoeRoutingKernelLauncher: call‐site parameter order verified; suggest adding documentation

  • Verified that the declaration in
    cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
    and the definition in
    cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
    both take (padded_cols, unpadded_cols) immediately after num_rows.
  • Inspected all call sites—including
    cpp/tensorrt_llm/thop/moeUtilOp.cpp (around line 241) and
    multiple invocations in moe_kernels.cu (around lines 3007, 3331, 3339) —and confirmed they each pass the full column count (e.g. hidden_size) first, then the unpadded count, then experts_per_token. No mismatches were found.
  • Nit: Please add a brief doc‐line above the function declaration in
    moe_util_kernels.h explaining the distinction between “padded_cols” (compute buffer width) vs. “unpadded_cols” (actual data width) to guard against accidental swaps.
cpp/tensorrt_llm/thop/moeUtilOp.cpp (1)

237-246: Caller passes (hidden_size, unpadded_hidden_size) to launcher — order looks correct

This matches the new “padded_cols, unpadded_cols” contract. Consider adding one invariant check to assert gemm2_output’s second dim equals hidden_size to fail fast on shape drift.

@@ void runMoEFinalizeScaleOp(...)
-    cutlass_kernels::finalizeMoeRoutingKernelLauncher<OutputType, UnfusedGemmOutputType>(
+    // Sanity: GEMM2 output should be [expanded_rows, hidden_size] (padded)
+    TORCH_CHECK(
+        gemm2_output && hidden_size > 0, "Invalid GEMM2 output or hidden_size.");
+    // Note: dtype-agnostic sizes, just check the logical shape.
+    // Caller ensures gemm2_output is a 2D tensor with shape [expanded_rows, hidden_size].
+    cutlass_kernels::finalizeMoeRoutingKernelLauncher<OutputType, UnfusedGemmOutputType>(
         static_cast<UnfusedGemmOutputType const*>(gemm2_output), final_output, biases, unpermuted_final_scales,
         unpermuted_row_to_permuted_row, permuted_row_to_unpermuted_row, token_selected_experts,
         expert_first_token_offset, num_rows, hidden_size, unpadded_hidden_size, experts_per_token, num_experts_per_node,
         parallelism_config, enable_alltoall, stream);
cpp/tensorrt_llm/thop/moeOp.cpp (2)

269-299: Duplicate bias validation block — remove repetition

This block duplicates the checks immediately above (Lines 269–283). Drop one copy to reduce maintenance burden.

-        if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value())
-        {
-            CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype);
-            CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype);
-            TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D.");
-            TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D.");
-            TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0],
-                "fc1_expert_weights and fc1_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0],
-                "fc2_expert_weights and fc2_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1],
-                "fc1_expert_biases should match fc1_expert_weights output shape.");
-            TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1],
-                "fc2_expert_biases should match fc2_expert_weights output shape.");
-        }

646-661: Optional: add a one-liner assert before mProfiler->run to catch missing prepare

If do_preparation was never true, mProfileWorkspace may be null. Consider asserting it's set when calling runProfiler to improve error messages.

@@
-        // Profile specific tactic. Assuming at least one preparation phase has been executed already.
+        // Profile specific tactic. Assuming at least one preparation phase has been executed already.
+        TORCH_CHECK(mProfileWorkspace != nullptr, "runProfiler called before prepare().");
         mProfiler->runProfiler(num_rows, profile, mProfileWorkspace, expert_weights_ptr, stream);

Also applies to: 672-681

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

1930-1932: Avoid per-iteration bounds branch in NoFilling finalize.

Looping to num_elems_in_padded_col and continuing when beyond original width adds divergence. Iterate directly to num_elems_in_orig_col as in the other finalize kernel.

@@
-        int64_t const num_elems_in_padded_col = padded_cols / FINALIZE_ELEM_PER_THREAD;
-        int64_t const num_elems_in_orig_col = unpadded_cols / FINALIZE_ELEM_PER_THREAD;
+        int64_t const num_elems_in_padded_col = padded_cols   / FINALIZE_ELEM_PER_THREAD;
+        int64_t const num_elems_in_orig_col   = unpadded_cols / FINALIZE_ELEM_PER_THREAD;
@@
-        for (int elem_index = start_offset; elem_index < num_elems_in_padded_col; elem_index += stride)
+        for (int elem_index = start_offset; elem_index < num_elems_in_orig_col; elem_index += stride)
         {
-            if (elem_index >= num_elems_in_orig_col)
-                continue; // Skip writing beyond original columns

Also applies to: 1941-1945


2046-2048: Unify parameter naming: “actual_cols” → “unpadded_cols”.

Elsewhere you use “unpadded_cols”. The explicit instantiation macro still says “actual_cols”, which hurts grep-ability and violates our consistency guideline.

-        int64_t const actual_cols, int64_t const experts_per_token, int64_t const num_experts_per_node,                \
+        int64_t const unpadded_cols, int64_t const experts_per_token, int64_t const num_experts_per_node,              \
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b2a9b93 and 69164a4.

📒 Files selected for processing (18)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (3 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp (6 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (12 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (27 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (10 hunks)
  • cpp/tensorrt_llm/thop/moeUtilOp.cpp (6 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (8 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (7 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (7 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
🧠 Learnings (13)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/thop/moeUtilOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-21T21:48:35.105Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.105Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-21T02:39:11.984Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:11.984Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm() (line 51 in lora.cpp), along with setWorkspace(). The stream parameter flows from loraOp.cpp through LoraImpl::run() to _runGemm() where setStream() is called appropriately. Adding setStream() in loraOp.cpp would be redundant and goes against the intended architectural design.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧬 Code graph analysis (8)
cpp/tensorrt_llm/thop/moeUtilOp.cpp (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (5)
  • parallelism_config (983-1090)
  • parallelism_config (983-983)
  • parallelism_config (1168-1256)
  • parallelism_config (1168-1168)
  • stream (831-843)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
  • finalizeMoeRoutingKernelLauncher (1986-2039)
  • finalizeMoeRoutingKernelLauncher (1986-1992)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (7)
  • mTotalTokens (540-540)
  • mHiddenSize (378-378)
  • mInterSize (539-539)
  • mNumExperts (379-379)
  • mK (381-381)
  • mWorkspace (515-515)
  • mFinalOutput (537-537)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)
tensorrt_llm/_torch/utils.py (1)
  • Fp4QuantizedTensor (97-104)
tensorrt_llm/quantization/utils/fp8_utils.py (1)
  • ceil_div (10-21)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
  • hidden_size (216-216)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (13)
  • parallelism_config (983-1090)
  • parallelism_config (983-983)
  • parallelism_config (1168-1256)
  • parallelism_config (1168-1168)
  • stream (831-843)
  • k (1513-1514)
  • k (1518-1529)
  • k (1518-1519)
  • k (1533-1541)
  • k (1533-1534)
  • k (1543-1555)
  • k (1543-1544)
  • k (1557-1558)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (4)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (6)
  • typeToDtypeID (2254-2289)
  • typeToDtypeID (2254-2254)
  • parallelism_config (983-1090)
  • parallelism_config (983-983)
  • parallelism_config (1168-1256)
  • parallelism_config (1168-1168)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
  • mNumExperts (970-970)
  • mK (972-972)
  • mGroupSize (976-976)
  • mUseLora (994-994)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (4)
  • mNumExperts (922-922)
  • mK (924-924)
  • mGroupSize (927-927)
  • mUseLora (944-944)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (4)
  • mNumExperts (174-174)
  • mGroupSize (179-179)
  • mUseBias (186-186)
  • mUseLora (202-202)
cpp/tensorrt_llm/thop/moeOp.cpp (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
  • enable_alltoall (193-200)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
  • hidden_size (216-216)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
cpp/tensorrt_llm/thop/moeOp.cpp (2)
  • num_rows (760-780)
  • num_rows (760-762)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (13)
  • parallelism_config (983-1090)
  • parallelism_config (983-983)
  • parallelism_config (1168-1256)
  • parallelism_config (1168-1168)
  • stream (831-843)
  • k (1513-1514)
  • k (1518-1529)
  • k (1518-1519)
  • k (1533-1541)
  • k (1533-1534)
  • k (1543-1555)
  • k (1543-1544)
  • k (1557-1558)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
  • mExpertUnpaddedHiddenSize (974-974)
  • mK (972-972)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (13)
  • k (1513-1514)
  • k (1518-1529)
  • k (1518-1519)
  • k (1533-1541)
  • k (1533-1534)
  • k (1543-1555)
  • k (1543-1544)
  • k (1557-1558)
  • parallelism_config (983-1090)
  • parallelism_config (983-983)
  • parallelism_config (1168-1256)
  • parallelism_config (1168-1168)
  • stream (831-843)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py

345-345: Line too long (182 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (23)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)

985-1009: The script above will print the full runMoe signatures in both the OSS and internal headers so we can confirm the arity differences before updating the call sites. Once we see the exact parameter lists, we can adjust the benchmark and test calls under each macro to match the corresponding header.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

91-93: Good: capture unpadded hidden size before padding

Storing self.unpadded_hidden_size ahead of W4A16 padding is correct and avoids downstream slicing in Python.


410-442: Confirm C++ registration of fused_moe includes new keyword args

I wasn’t able to locate any m.def("fused_moe(...)" or corresponding schema in the C++ extension (e.g. in moeOp.cpp under TORCH_LIBRARY(trtllm, m)). Without those, passing swizzled_input_sf and unpadded_hidden_size will trigger a runtime TypeError.

Please:

  • Verify that fused_moe is registered in C++ (look for m.def("fused_moe() or a TORCH_LIBRARY block in cpp/tensorrt_llm/thop/moeOp.cpp).
  • Ensure its schema signature explicitly lists
    bool swizzled_input_sf
    int64_t unpadded_hidden_size
  • Confirm the FusedMoeRunner::operator() method matches those parameters.

If the schema is missing these kwargs, let me know—I can draft the C++ registration update to include them.

cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)

195-196: Add unpadded hidden size test knob

Introducing mUnpaddedHiddenSize (and resetting it in TearDown) is a clean way to validate unpadded output semantics.


1243-1246: OSS runMoe: correctly forwards unpaddedHiddenSize when set

The ternary forwarding ensures backward compatibility when unpadded size is unset.


1479-1480: Validation uses unpadded size when provided

Good change; aligns test checks with kernel’s stride semantics.


1504-1508: Assertion indexes into final tensor using unpadded width

Correct and necessary when hidden is padded in kernel but logical width is smaller.

cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (3)

25-25: Include is appropriate

Required for fixed-width integer usage in new signatures; good addition.


573-581: AI summary inconsistent with code: runMoe did not add unpadded_hidden_size in internal header

The summary claims runMoe includes unpadded_hidden_size and expanded_num_rows here, but this header’s runMoe still takes (num_rows, hidden_size, inter_size, ...). Please correct either the code or the summary.


801-804: All BlockScaleFC2 call sites have been updated

I’ve verified that the only invocation of BlockScaleFC2 in moe_kernels.cu now includes the new unpadded_hidden_size (and the additional enable_alltoall flag), matching the updated signature. No remaining mismatches were found.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)

617-626: Override matches interface — good propagation of unpadded_hidden_size

The override order and constness align with the interface, reducing chances of ODR/signature mismatches. LGTM.


646-661: Static gemm2: consistent placement of unpadded_hidden_size

The static helper mirrors the interface ordering; this minimizes caller confusion and avoids mis-threading K/N dims. LGTM.


849-858: Confirmed unpadded_hidden_size usage in BlockScaleFC2

Verified that BlockScaleFC2 in moe_kernels.cu uses hidden_size solely for loads/compute and unpadded_hidden_size for final-write bounds. No changes required.

• In the kernel body (moe_kernels.cu:2993–3002),
shape_n is set from hidden_size for the GEMM compute.
• finalizeMoeRoutingKernelLauncher is called with both parameters—using hidden_size for compute and unpadded_hidden_size for the write bounds (moe_kernels.cu:3013–3018).
• Any buffer zeroing (cudaMemsetAsync) and TMA epilogue fusion both leverage unpadded_hidden_size for final_output sizing.

cpp/tensorrt_llm/thop/moeUtilOp.cpp (2)

282-283: Output tensor shape switched to unpadded — good

Allocating {num_rows, unpadded_hidden_size} aligns with the sliced writeback semantics. LGTM.


332-347: All Python wrappers updated with new unadded_hidden_size argument
A repository-wide search confirms that both fused-MoE modules pass the newly added unadded_hidden_size into torch.ops.trtllm.moe_finalize_scale_op:

  • In fused_moe_cute_dsl.py (line 229), self.unpadded_hidden_size is provided as the 11th argument.
  • In fused_moe_deepgemm.py (line 566), self.unpadded_hidden_size is likewise passed.

No other Python callers invoke this op directly, and all existing tests exercise the high-level fused modules rather than calling the op schema. No further changes are required.

cpp/tensorrt_llm/thop/moeOp.cpp (3)

392-397: Output shape switched to {num_rows, unpadded_hidden_size_val} — good

Matches sliced finalization. Ensure downstream consumers read only unpadded cols.


543-545: Min-latency output shape uses unpadded cols — good

Sliced width matches the new semantics. LGTM.


667-674: Profiler init: unpadded parameter passed — good; keep legacy defaulting in callers

The preparation path already maps 0 → hidden_size when invoking init. That keeps compatibility. LGTM.

Also applies to: 671-681

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (5)

4038-4040: Finalize fusion uses num_rows (tokens) as the row dimension — correct.

Passing num_tokens for the row dimension in setFinalizeFusionParams aligns with the swapped/transpose layout of the epilogue; good catch and consistent with prior guidance.

Also applies to: 4656-4656


3013-3018: End-to-end threading of unpadded_hidden_size into finalize launchers looks correct.

Calls to finalizeMoeRoutingKernelLauncher now pass both padded (hidden_size) and unpadded hidden sizes; this aligns with the slicing objective and prevents OOB writes. LGTM.

Also applies to: 3331-3336, 3340-3343


3279-3281: Zeroing final_output for fused finalize is sized to unpadded width — good.

cudaMemsetAsync uses num_rows * unpadded_hidden_size, matching the epilogue’s write extents. This avoids touching padded columns and reduces unnecessary bandwidth.


2021-2024: NoFilling path plumbs expert_first_token_offset and uses (padded, unpadded) correctly.

The launcher passes expert_first_token_offset and both column counts into finalizeMoeRoutingNoFillingKernel; indexing and guards then operate on unpadded width while using padded strides. Solid.


4821-4822: Initialization of mExpertUnpaddedHiddenSize confirmed in init

  • In cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h, within GemmProfilerBackend::init(...) (lines ~914–950), there is an explicit assignment:
    mExpertUnpaddedHiddenSize = unpadded_hidden_size;
    ensuring the field is initialized before any call to runProfiler.

No further changes required.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16147 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12146 completed with status: 'FAILURE'

@bobboli
Copy link
Collaborator Author

bobboli commented Aug 23, 2025

/bot run --reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16258 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16258 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12222 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@bobboli bobboli enabled auto-merge (squash) August 24, 2025 07:22
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

276-301: Fake fused_moe should return unpadded width when requested

The fake currently shapes outputs with the padded hidden size only. With the upstream change removing post-op slicing, graph building will expect the unpadded width when unpadded_hidden_size is provided. Update the fake to reflect that to keep compile-time shapes consistent.

Apply this diff:

 @torch.library.register_fake("trtllm::fused_moe")
 def _(
@@
-    tuner_num_tokens: Optional[int] = None,
-    tuner_top_k: Optional[int] = None,
-    unpadded_hidden_size: Optional[int] = None,
+    tuner_num_tokens: Optional[int] = None,
+    tuner_top_k: Optional[int] = None,
+    unpadded_hidden_size: Optional[int] = None,
 ):
     seq_len = input.shape[0]
     if use_int8_woq_per_channel:
@@
-        hidden_size = fc2_expert_weights.shape[2]
+        hidden_size = fc2_expert_weights.shape[2]
     else:
         hidden_size = fc2_expert_weights.shape[1]
 
+    # Determine output hidden dimension: prefer unpadded when provided (>0).
+    out_hidden_size = (hidden_size if (unpadded_hidden_size is None
+                                       or int(unpadded_hidden_size) <= 0)
+                       else int(unpadded_hidden_size))
+
     if min_latency_mode:
         num_experts_on_rank = fc2_expert_weights.shape[0]
-        output_shape = [seq_len * num_experts_on_rank, hidden_size]
+        output_shape = [seq_len * num_experts_on_rank, out_hidden_size]
         experts_to_token_score_shape = [num_experts_on_rank, seq_len]
         active_expert_global_ids_shape = [num_experts_on_rank]
         return [
             input.new_empty(output_shape, dtype=output_dtype),
             input.new_empty([1], dtype=torch.int32),
             input.new_empty(experts_to_token_score_shape, dtype=torch.float32),
             input.new_empty(active_expert_global_ids_shape, dtype=torch.int32),
         ]
     else:
-        return [input.new_empty([seq_len, hidden_size], dtype=output_dtype)]
+        return [input.new_empty([seq_len, out_hidden_size], dtype=output_dtype)]
♻️ Duplicate comments (2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

122-171: Add docstring + input validation for unpadded_hidden_size

We should document semantics and guard invalid values (<=0, > padded hidden size) at the fused_moe entrypoint. This mirrors the kernel’s expectations and reduces silent shape surprises.

Apply this diff:

 @torch.library.custom_op("trtllm::fused_moe", mutates_args=())
 def fused_moe(
@@
-    unpadded_hidden_size: Optional[int] = None,
+    unpadded_hidden_size: Optional[int] = None,
 ) -> List[torch.Tensor]:
+    """
+    Fused MoE custom op. If unpadded_hidden_size is provided (> 0), outputs are
+    sized to that hidden dimension; otherwise the padded hidden size is used.
+
+    Args:
+        unpadded_hidden_size: Original (unpadded) hidden size to slice output to.
+          Must satisfy 0 < unpadded_hidden_size <= padded hidden size inferred
+          from fc2_expert_weights.
+    """
@@
-    # Only the non-alltoall case is considered for profiling in the warmup phase.
+    # Only the non-alltoall case is considered for profiling in the warmup phase.
@@
     else:
         assert tuner_num_tokens is None
         assert tuner_top_k is None
         tuner_input = input
         tuner_top_k = token_selected_experts.size(1)
+
+    # Sanity-check unpadded_hidden_size if provided.
+    padded_hidden_size = (fc2_expert_weights.shape[2]
+                          if use_int8_woq_per_channel
+                          else fc2_expert_weights.shape[1])
+    if unpadded_hidden_size is not None:
+        if not (0 < int(unpadded_hidden_size) <= int(padded_hidden_size)):
+            raise ValueError(
+                f"unpadded_hidden_size ({unpadded_hidden_size}) "
+                f"must be in (0, {int(padded_hidden_size)}]"
+            )
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

11-11: Fix ceil_div import to the correct module

ceil_div is defined under quantization/utils/fp8_utils.py, not in _torch/utils.py. Importing from ...utils will raise ImportError.

Apply this diff:

-from ...utils import AuxStreamType, EventType, Fp4QuantizedTensor, ceil_div
+from ...utils import AuxStreamType, EventType, Fp4QuantizedTensor
+from ...quantization.utils.fp8_utils import ceil_div
🧹 Nitpick comments (2)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)

418-439: Finalize-scale fake op: honor unpadded_hidden_size with sane fallback and validation

Right now the fake returns shape (num_rows, int(unpadded_hidden_size)) unconditionally. If callers pass 0/None-equivalent to indicate “no slicing,” this will produce a zero-width tensor. Add a fallback to hidden_size when unpadded_hidden_size <= 0 and validate it never exceeds hidden_size.

Apply this diff:

     @torch.library.register_fake("trtllm::moe_finalize_scale_op")
     def _(
@@
         num_rows: torch.SymInt,
         hidden_size: torch.SymInt,
-        unpadded_hidden_size: torch.SymInt,
+        unpadded_hidden_size: torch.SymInt,
         experts_per_token: int,
@@
     ):
-        num_rows_val = int(num_rows)
-        unpadded_hidden_size_val = int(unpadded_hidden_size)
-        return gemm2_output.new_empty((num_rows_val, unpadded_hidden_size_val),
-                                      dtype=gemm2_output.dtype)
+        num_rows_val = int(num_rows)
+        hidden_size_val = int(hidden_size)
+        target_size = int(unpadded_hidden_size)
+        # Allow 0 to mean "no slicing" (use padded hidden size).
+        if target_size <= 0:
+            target_size = hidden_size_val
+        else:
+            assert target_size <= hidden_size_val, (
+                "unpadded_hidden_size must be <= hidden_size"
+            )
+        return gemm2_output.new_empty((num_rows_val, target_size),
+                                      dtype=gemm2_output.dtype)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

339-386: is_sf_swizzled logic looks correct; shorten the long inline comment

The flag is set to False whenever post-quant communication occurs (alltoall or allgather), which matches the deferred-swizzle design. One nit: the comment at Line 342 exceeds the 120-char limit flagged by Ruff.

Apply this diff:

-        is_sf_swizzled = True  # In case of post-quant communication, scaling factors will not be swizzled before communication, and swizzling after communication is merged into MoE.
+        # Default: scaling factors are already swizzled.
+        # For post-quant communication, we defer swizzling into the MoE op.
+        is_sf_swizzled = True
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 69164a4 and 335a5c8.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (7 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)
tensorrt_llm/_torch/utils.py (1)
  • Fp4QuantizedTensor (97-104)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
  • hidden_size (216-216)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py

342-342: Line too long (182 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)

53-74: Threading unpadded_hidden_size through MoERunner is sound

Using 0 as a sentinel in self.unpadded_hidden_size keeps the Python <-> C++ interface simple and backwards-compatible. Looks good.


189-190: Forward unpadded_hidden_size into MoERunner construction

Good—this ensures the profile and run paths see the same target hidden size.


240-241: Propagate unpadded_hidden_size into run_moe/run_moe_min_latency

Correct propagation to the execution path. No concerns.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)

90-92: Good: capture unpadded_hidden_size before any padding

Storing the original hidden size early ensures the op can request unpadded outputs regardless of backend padding. This aligns with the fused op’s new behavior.


393-425: Fused op call: swizzled_input_sf + unpadded_hidden_size propagation LGTM

Passing swizzled_input_sf=is_sf_swizzled and unpadded_hidden_size=self.unpadded_hidden_size matches the new kernel contracts and removes Python-side slicing. Nicely done.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
@bobboli
Copy link
Collaborator Author

bobboli commented Aug 25, 2025

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

2073-2095: Guard GPT-OSS-120B test for non-OSS path or neutralize unpadded override

Non-OSS kernels don’t honor an unpadded output stride; combined with the current validation, this can still cause false negatives in non-OSS builds. You have two viable options:

  • Preferable for fidelity: compile-time skip when OSS path is unavailable (keeps the test’s intent intact).
  • Alternative: only set mUnpaddedHiddenSize when OSS is enabled; otherwise leave it as 0 so the test runs with padded stride.

Below shows the skip approach; if you’d rather keep it running everywhere, I can provide the “neutralize override” variant.

 TYPED_TEST(MixtureOfExpertsTest, ParallelismType##GptOss120b)                                                      \
 {                                                                                                                  \
+    /* Only meaningful with the OSS Cutlass MoE path (unpadded stride support) */                                  \
+#ifndef USING_OSS_CUTLASS_MOE_GEMM                                                                                 \
+    GTEST_SKIP() << "Skipping GptOss120b: requires USING_OSS_CUTLASS_MOE_GEMM";                                    \
+    return;                                                                                                        \
+#endif                                                                                                             \
     this->mIsLongTest = true;                                                                                      \
     this->mUseBias = true;                                                                                         \
     this->mActType = ActivationType::Swiglu;                                                                       \
@@
     float inter_size_fraction = float(inter_size) / hidden_size;                                                   \
-    this->mUnpaddedHiddenSize = 2880;                                                                              \
+    this->mUnpaddedHiddenSize = 2880;                                                                              \
@@
     if (!this->checkSufficientTestMemory(75, hidden_size, 128, 4, true))                                           \

Note: Once the compareFinal out_stride fix lands, this test will not mis-index; the guard keeps the intent of validating unpadded-stride output specifically on the OSS path.

🧹 Nitpick comments (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

195-196: Add basic sanity checks for mUnpaddedHiddenSize

Good addition. Add a lightweight precondition to ensure the unpadded dimension is never larger than the padded one and is non-negative. This helps catch misconfigurations early in tests.

Example (place right before invoking runMoe in runMoEPermute):

@@
-        QuantParams quant_params;
+        QuantParams quant_params;
+        if (mUnpaddedHiddenSize > 0)
+        {
+            ASSERT_LE(mUnpaddedHiddenSize, mHiddenSize) << "Unpadded hidden size must be <= hidden size.";
+        }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 335a5c8 and 512b036.

📒 Files selected for processing (1)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (8 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧠 Learnings (2)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧬 Code graph analysis (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (7)
  • mTotalTokens (540-540)
  • mHiddenSize (378-378)
  • mInterSize (539-539)
  • mNumExperts (379-379)
  • mK (381-381)
  • mWorkspace (515-515)
  • mFinalOutput (537-537)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (5)

257-258: Reset of mUnpaddedHiddenSize in TearDown is correct

Resetting the test-scoped override avoids bleed-over between tests. LGTM.


2065-2072: DeepSeekV3 param threading looks good

The inter_size_fraction calculation and parallel test invocation are consistent with the earlier memory guard.


2351-2353: OSS profiler distribution dims updated correctly

Using 1024, 1024, 4096 for (hidden, unpadded_hidden, inter) is consistent. LGTM.


2301-2302: I’ve pulled the GemmProfilerBackend definitions from both internal and public Cutlass headers around their struct declarations so we can inspect the init(...) signature (showing hidden vs. unpadded_hidden arguments). Once we see the exact parameter list, we can confirm the tuple ordering and DEFAULT_HIDDEN_SIZE usage.


1241-1247: runMoe parameters match updated signature—no changes required.

  • In cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (around lines 460–465), the pure-virtual runMoe is declared as
    …, QuantParams quant_params, int64_t num_rows, int64_t hidden_size, int64_t unpadded_hidden_size, int64_t inter_size, …
  • In the unit test cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (lines 1241–1247), the call passes
    mTotalTokens (num_rows), mHiddenSize (hidden_size), (mUnpaddedHiddenSize > 0 ? mUnpaddedHiddenSize : mHiddenSize) (unpadded_hidden_size), mInterSize / parallelism_config.tp_size (inter_size)
    exactly in that order.

All three size parameters line up with the header declaration.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16416 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16416 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12338 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@bobboli bobboli merged commit bf1b958 into NVIDIA:main Aug 25, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants