KEMBAR78
[https://nvbugs/5392414] [fix] Add customized default routing method by ChristinaZ · Pull Request #6818 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@ChristinaZ
Copy link
Collaborator

@ChristinaZ ChristinaZ commented Aug 12, 2025

Summary by CodeRabbit

  • New Features

    • Faster MOE Top‑K path for small inputs (len ≤128, k ≤8); BF16 support and optional softmax-before-topk routing exposed.
    • New default MOE routing Torch operator and Python flag (force_enable_pytorch_op) to force a PyTorch fallback.
  • Changes

    • Top‑K dispatch now routes eligible small inputs through the MOE path; iterators unified to Thrust and index type standardized.
    • Runtime dispatch and launcher updated to select specialized MOE kernels for small cases.
  • Removals

    • Legacy renorm MOE routing kernel and its old launcher removed.

Defauled routing-related optimization

  1. For the Pytorch backend, added a customized Pytorch Op.
  2. For the TRT backend, add one more customized topK kernel special for the MoE topK calculation. Now it only takes 3 us for the sample from the customer.
image

Test Coverage

Follow the example in nvBug.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@ChristinaZ ChristinaZ requested a review from a team as a code owner August 12, 2025 08:23
@ChristinaZ ChristinaZ requested a review from mikeiovine August 12, 2025 08:23
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 12, 2025

📝 Walkthrough

Walkthrough

Adds a warp-level Top-K reduction utility and new MOE routing kernels/launcher, removes the previous renormMoeRouting implementation, integrates a small-input MOE Top-K path into topkLastDim with runtime dispatch, updates THOP/Torch operator wiring and Python routing gating, and replaces one build source entry.

Changes

Cohort / File(s) Summary
MOE Top-K utilities
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
New header: warp-level Top-K reduction (TopKRedType value/index packing, TopKIdx, small-N Sort specializations, reduceTopK device API, arch fast-paths).
New custom MOE routing kernels & launcher
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu, cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h
Added templated customMoeRoutingKernel, invokeRenormMoeRouting launcher with runtime dispatch over MaxNumExperts/MaxTopK and DoSoftmaxBeforeTopK; per-warp/vector softmax helpers, nextPowerOfTwo, explicit instantiations; header template updated to include DoSoftmaxBeforeTopK.
Removed legacy MOE routing
cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu
File deleted: removes prior renormMoeRoutingKernel, its Top-K/softmax machinery and launcher; public symbols removed.
Top-K last-dim integration / dispatch
cpp/tensorrt_llm/kernels/topkLastDim.cu
Added small-input MOE path (len ≤ 128, k ≤ 8, largest) via moe_reduce_topk and moe_topk_kernel; switched CUB counting/transform iterators to Thrust; normalized index alias to IdxT; updated standalone radix signatures; added nextPowerOfTwo helper.
THOP / Torch operator wiring & build
cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp, cpp/tensorrt_llm/thop/CMakeLists.txt
Replaced include with custom header; introduced templated core custom_moe_routing_op<DoSoftmaxBeforeTopK> plus wrappers renorm_moe_routing_op/default_moe_routing_op; swapped build source renormMoeRoutingOp.cppcustomMoeRoutingOp.cpp; registered default_moe_routing_op.
Python routing gate & shim op
tensorrt_llm/_torch/modules/fused_moe/routing.py, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Added force_enable_pytorch_op and apply_pytorch; apply() chooses PyTorch fallback when forced or for large expert/top_k, otherwise calls TRT op; added fake trtllm::default_moe_routing_op shim returning tensors.
PDL/PDL_PROFILE guard adjustments
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/...
.../RoutingKernel.cuh
Removed inner PDL_PROFILE-based gating around cudaTriggerProgrammaticLaunchCompletion() calls in several SM90+ kernels; calls now compiled based on __CUDA_ARCH__ >= 900 and KernelParams::UsePdl only.

Sequence Diagram(s)

sequenceDiagram
  participant Python as DefaultMoeRoutingMethod.apply
  participant PyTorch as apply_pytorch
  participant TRT as trtllm::default_moe_routing_op

  Python->>Python: evaluate force_enable_pytorch_op or (num_experts>128 or top_k>8)
  alt Use PyTorch path
    Python->>PyTorch: softmax(router_logits) + topk
    PyTorch-->>Python: (values, indices)
  else Use TensorRT op
    Python->>TRT: default_moe_routing_op(router_logits, top_k)
    TRT-->>Python: (values, indices)
  end
Loading
sequenceDiagram
  participant Caller as topkLastDim invoke
  participant MOE as moe_reduce_topk dispatch
  participant Radix as standalone_stable_radix_11bits

  Caller->>Caller: if len<=128 && k<=8 && is_largest
  alt Small-input MOE path
    Caller->>MOE: dispatch specialized moe_topk_kernel by (len ∈ {32,64,96,128}, k ∈ {1,2,4,8})
    MOE-->>Caller: top-k values/indices
  else Radix path
    Caller->>Radix: standalone_stable_radix_11bits(...)
    Radix-->>Caller: top-k values/indices
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

SW Architecture

Suggested reviewers

  • byshiue
  • yizhang-nv
  • rosenrodt
  • nv-guomingz
  • MatthiasKohl

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ChristinaZ ChristinaZ force-pushed the feat_opt_default_routing branch from 8f45eaa to 96b3a0c Compare August 12, 2025 08:32
@ChristinaZ ChristinaZ requested review from byshiue and djns99 and removed request for mikeiovine August 12, 2025 08:32
@ChristinaZ ChristinaZ changed the title [https://nvbugspro.nvidia.com/bug/5392414] [fix] Add customized default routing method [https://nvbugs/5392414] [fix] Add customized default routing method Aug 12, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1-16: Missing current year in copyright header.

According to the coding guidelines, all TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year (2025). The header shows "1993-2024" but should be "1993-2025".

Apply this diff to update the copyright year:

- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1661-1667: Small-input routing to MOE path.

The routing logic correctly implements the small-input optimization described in the summary:

  • Routes small inputs (inputLength ≤ 128, k ≤ 8) with is_largest=true to the MOE path
  • Uses printf for diagnostics as mentioned in the summary
  • The condition logic appears correct for the optimization criteria

However, consider removing or making the printf conditional for production builds.

Apply this diff to make the debug output conditional:

-        printf("moe_reduce_topk inputLength=%d k=%d \n", inputLength, k);
+#ifdef DEBUG_MOE_TOPK
+        printf("moe_reduce_topk inputLength=%d k=%d \n", inputLength, k);
+#endif
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f45eaa and 96b3a0c.

📒 Files selected for processing (8)
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1 hunks)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (6 hunks)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/topkLastDim.cu (13 hunks)
  • cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/thop/defaultMoeRoutingOp.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/routing.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • cpp/tensorrt_llm/thop/CMakeLists.txt
  • cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp
  • cpp/tensorrt_llm/thop/defaultMoeRoutingOp.cpp
  • tensorrt_llm/_torch/modules/fused_moe/routing.py
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.h
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
🔇 Additional comments (10)
cpp/tensorrt_llm/kernels/topkLastDim.cu (10)

27-34: Good addition of cooperative groups and thrust headers.

These headers support the new MOE topK functionality. The cooperative groups are needed for warp-level operations, and thrust iterators replace the cub iterators as mentioned in the summary.


211-212: Standardized indexing type usage.

The change from idxT to IdxT in function parameters aligns with the type unification mentioned in the summary, providing consistency across the codebase.


275-276: Consistent type parameter naming.

The vectorized_process function parameters have been updated to use the standardized IdxT type consistently.


1174-1234: MOE TopK kernel implementation looks well-structured.

The new moe_topk namespace introduces a specialized warp-level TopK implementation:

  • Proper cooperative groups usage for warp operations
  • Template parameters allow for compile-time optimization
  • Handles different numeric types with appropriate min value initialization
  • Uses warp-level reduction for efficient TopK computation

The implementation follows CUDA best practices and should provide the performance benefits mentioned in the PR objectives (3 μs latency).


1293-1295: Migration to thrust iterators.

The replacement of cub::CountingInputIterator and cub::TransformInputIterator with thrust::counting_iterator and thrust::transform_iterator aligns with the modernization mentioned in the summary.


1421-1423: Consistent iterator migration in one-block version.

The same thrust iterator migration is applied consistently in the one-block kernel version, maintaining code consistency across different execution paths.


1524-1541: Helper function for power-of-two calculation.

The next_power_of_two function is well-implemented with proper overflow protection and edge case handling. This supports the dynamic kernel selection mentioned in the summary.


1543-1607: Dynamic MOE kernel selection implementation.

The moe_reduce_topk function provides sophisticated dispatch logic:

  • Supports different combinations of max_len (32, 64, 96, 128) and max_topK (1, 2, 4, 8)
  • Uses template specialization for optimal performance
  • Proper grid/block dimension calculations
  • Falls back to a default kernel for unsupported configurations

This implementation matches the dynamic kernel selection described in the summary.


1616-1617: Type alias standardization.

The change from implicit type usage to explicit IdxT alias improves code clarity and maintains consistency with the type unification effort.


25-25: Include file verified
The header cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh exists and provides the expected MOE Top-K utilities (e.g., TopKRedType, Sort, and reduceTopK). No further action needed.

@ChristinaZ
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14956 [ run ] triggered by Bot

@ChristinaZ ChristinaZ force-pushed the feat_opt_default_routing branch from 96b3a0c to fcdab2e Compare August 12, 2025 11:49
@ChristinaZ
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14965 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14956 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14965 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit fcdab2e

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Nitpick comments (5)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1)

2-16: Update copyright year to 2025

The copyright header shows 2019-2023, but this is a new file added in 2025. Update the copyright year range to reflect the current year.

 /*
- * Copyright (c) 2019-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
  *
cpp/tensorrt_llm/kernels/topkLastDim.cu (4)

25-34: Organize includes according to conventions

System headers should come before project headers, and includes should be organized in a consistent order.

-#include "moeTopKFuncs.cuh"
-#include "topkLastDim.h"
 #include <cooperative_groups.h>
 #include <cooperative_groups/reduce.h>
 #include <cub/cub.cuh>
 #include <cuda/atomic>
 #include <limits>
 #include <thrust/iterator/counting_iterator.h>
 #include <thrust/iterator/transform_iterator.h>
 #include <type_traits>
+
+#include "moeTopKFuncs.cuh"
+#include "topkLastDim.h"

41-41: Add type alias to improve consistency

Add a using statement for IdxT to match the pattern used elsewhere in the codebase.

 using SizeType32 = tensorrt_llm::runtime::SizeType32;
+using IdxT = SizeType32;

1209-1210: Variable names should follow lowerCamelCase convention

According to the coding guidelines, local variables should use lowerCamelCase naming.

-        InputT inputScore[MaxLen / WARP_SIZE];
-        IdxT inputIndex[MaxLen / WARP_SIZE];
+        InputT inputScore[MaxLen / kWARP_SIZE];
+        IdxT inputIndex[MaxLen / kWARP_SIZE];

1524-1541: Simplify next_power_of_two implementation

The function can be simplified using bit manipulation techniques and should handle edge cases better.

-int next_power_of_two(int num)
+constexpr int nextPowerOfTwo(int num)
 {
-    if (num <= 0)
+    if (num <= 1)
     {
-        return 1; // Handle invalid input
+        return 1;
     }
-    int power = 1;
-    while (power < num)
-    {
-        // Check for overflow before shifting
-        if (power > INT_MAX / 2)
-        {
-            return power;
-        }
-        power <<= 1;
-    }
-    return power;
+    // Round up to the next power of 2
+    --num;
+    num |= num >> 1;
+    num |= num >> 2;
+    num |= num >> 4;
+    num |= num >> 8;
+    num |= num >> 16;
+    return num + 1;
 }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96b3a0c and fcdab2e.

📒 Files selected for processing (8)
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1 hunks)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (6 hunks)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/topkLastDim.cu (13 hunks)
  • cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/thop/defaultMoeRoutingOp.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/routing.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp
  • cpp/tensorrt_llm/thop/CMakeLists.txt
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.h
  • tensorrt_llm/_torch/modules/fused_moe/routing.py
  • cpp/tensorrt_llm/thop/defaultMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1)

198-198: Remove macro undef after converting to function

After converting the TOPK_SWAP macro to a function template, this undef should be removed.

cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1174-1234: Close moe_topk namespace with proper comment

The namespace is opened but not closed with a comment as required by coding guidelines.

Add after line 1234:

 } // end for tokenId
 }
+} // namespace moe_topk
⛔ Skipped due to learnings
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.287Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh} : In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)

Copy link
Collaborator

@djns99 djns99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some potential style refactors but otherwise good

@ChristinaZ ChristinaZ force-pushed the feat_opt_default_routing branch from fcdab2e to 42107ef Compare August 15, 2025 03:02
@ChristinaZ ChristinaZ requested a review from a team as a code owner August 15, 2025 03:02
@ChristinaZ ChristinaZ requested a review from yizhang-nv August 15, 2025 03:02
@ChristinaZ
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15382 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (2)

17-20: Include guard added correctly (matches guidelines).

Header is now wrapped with TRTLLM_MOETOPKFUNCS_CUH_H. Good catch and fix.


115-163: Replace TOPK_SWAP macro with an inline device helper.

Avoid macros per coding guidelines; use an inline device function for readability and type-safety.

Suggested refactor:

-#define TOPK_SWAP(I, J)                                                                                                \
-    {                                                                                                                  \
-        auto pairMin = min(topK[I].compValIdx, topK[J].compValIdx);                                                    \
-        auto pairMax = max(topK[I].compValIdx, topK[J].compValIdx);                                                    \
-        topK[I].compValIdx = pairMax;                                                                                  \
-        topK[J].compValIdx = pairMin;                                                                                  \
-    }
+template <typename RedType>
+__device__ inline void topkSwap(RedType* topK, int i, int j)
+{
+    auto pairMin = min(topK[i].compValIdx, topK[j].compValIdx);
+    auto pairMax = max(topK[i].compValIdx, topK[j].compValIdx);
+    topK[i].compValIdx = pairMax;
+    topK[j].compValIdx = pairMin;
+}

And update call sites:

-        TOPK_SWAP(0, 1);
+        topkSwap(topK, 0, 1);
-        TOPK_SWAP(0, 1);
-        TOPK_SWAP(1, 2);
-        TOPK_SWAP(0, 1);
+        topkSwap(topK, 0, 1);
+        topkSwap(topK, 1, 2);
+        topkSwap(topK, 0, 1);
-        TOPK_SWAP(0, 2);
-        TOPK_SWAP(1, 3);
-        TOPK_SWAP(0, 1);
-        TOPK_SWAP(2, 3);
-        TOPK_SWAP(1, 2);
+        topkSwap(topK, 0, 2);
+        topkSwap(topK, 1, 3);
+        topkSwap(topK, 0, 1);
+        topkSwap(topK, 2, 3);
+        topkSwap(topK, 1, 2);
🧹 Nitpick comments (6)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (2)

83-95: Use ‘asm volatile’ for redux to avoid reordering/elision by the compiler.

The inline PTX should be marked volatile to prevent the compiler from optimizing it away or reordering around memory ops.

Apply:

-            TypeCmp result;
-            asm("redux.sync.max.u32 %0, %1, 0xffffffff;\n" : "=r"(result) : "r"(compValIdx));
+            TypeCmp result;
+            asm volatile("redux.sync.max.u32 %0, %1, 0xffffffff;\n" : "=r"(result) : "r"(compValIdx));
             return result;

169-173: Fix misleading static_assert message.

N is per-lane candidates (MaxLen / warpSize). The error message talks about “<= 128” which doesn’t match the check K < 5 here. Clarify the message to avoid confusion.

-    static_assert(N < 5, "Only support candidates number less than or equal to 128");
+    static_assert(N < 5, "Supports at most 4 per-lane candidates (i.e., MaxLen <= 128 with warpSize=32)");
cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1558-1601: ‘case 96’ is unreachable with current nextPowerOfTwo logic.

max_len is set to max(32, nextPowerOfTwo(len)), which never yields 96. Either remove the 96 branch or change the rounding policy if 3-warps (96) specialization is intended.

Would you like me to adjust the rounding to nearest-multiple-of-32 (capped at 128) to actually hit the 96 specialization, or remove the dead case?

cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (3)

35-38: Prefer k-prefixed constants per project convention.

Rename BLOCK_SIZE, WARP_SIZE, WARPS_PER_BLOCK to kBLOCK_SIZE/kWARP_SIZE/kWARPS_PER_BLOCK for consistency with coding guidelines (and with moe_topk in topkLastDim.cu).

If you want, I can generate a follow-up patch touching declarations and all local uses in this file.


122-135: Index buffer types must match reduceTopK signature (int32_t).

reduceTopK currently takes int32_t (&outIdx)[K] and int32_t (&idx)[N]. You use IdxT here; compilation is safe only because IdxT is instantiated as int32_t. If IdxT changes, this will break.

Consider making these arrays int32_t to match the utility’s signature and cast when writing to topkIndices if needed. Otherwise, please confirm IdxT is guaranteed to be int32_t in all instantiations.

Would you like a small patch to switch local arrays to int32_t while preserving the external API?


173-190: Duplicate nextPowerOfTwo utility across files.

The helper is duplicated in topkLastDim.cu. Consider moving it to a shared header (e.g., cudaUtils.h) to avoid divergence and duplicated overflow checks.

I can consolidate this into a single utility and update both call sites.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fcdab2e and 42107ef.

📒 Files selected for processing (5)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1 hunks)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/topkLastDim.cu (13 hunks)
💤 Files with no reviewable changes (1)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
🧠 Learnings (8)
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{h,hpp,hxx,hh,cuh} : Header files must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh} : In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh} : Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1)
cpp/tests/unit_tests/kernels/routing/routingTest.h (1)
  • normTopkProb (225-293)
cpp/tensorrt_llm/kernels/topkLastDim.cu (2)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (3)
  • T (42-68)
  • nextPowerOfTwo (173-190)
  • nextPowerOfTwo (173-173)
cpp/include/tensorrt_llm/common/cudaUtils.h (1)
  • getMultiProcessorCount (393-400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1)

203-205: Namespace closing comments are consistent.

Closing comments match the opened namespaces and follow the guideline. LGTM.

cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h (1)

26-28: Template interface change looks consistent with .cu definition.

The added DoSoftmaxBeforeTopK template parameter matches the definition and usage in the CU. No issues spotted.

cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1661-1671: Routing decision looks good and keeps legacy path for general cases.

Small-input fast path for MOE (len <= 128, k <= 8, largest) and fallback to radix top-k are wired correctly.

cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (2)

102-115: Consistent warp tile size and index math.

The cooperative_groups tiled partition and warp/lane indexing is correct and matches the per-warp data layout assumptions.


240-255: Launch config and programmatic stream serialization wiring looks solid.

Good use of cudaLaunchKernelEx and env-driven serialization. The post-launch sync check is appreciated.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15382 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11595 completed with status: 'FAILURE'

@ChristinaZ ChristinaZ force-pushed the feat_opt_default_routing branch from 42107ef to 9e459cc Compare August 15, 2025 09:14
@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15432 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (2)

17-19: Redundant include guard with #pragma once

The file uses both #pragma once and traditional include guards. While this provides compatibility, the coding guidelines indicate header files must use include guards. The #pragma once is redundant.

-#pragma once
 #ifndef TRTLLM_MOETOPKFUNCS_CUH_H
 #define TRTLLM_MOETOPKFUNCS_CUH_H

115-121: Replace macro with inline function template

According to the coding guidelines, macros should be avoided in favor of constants and inline functions. This macro performs swapping logic that should be converted to a function template.

Convert the macro to an inline function template:

-#define TOPK_SWAP(I, J)                                                                                                \
-    {                                                                                                                  \
-        auto pairMin = min(topK[I].compValIdx, topK[J].compValIdx);                                                    \
-        auto pairMax = max(topK[I].compValIdx, topK[J].compValIdx);                                                    \
-        topK[I].compValIdx = pairMax;                                                                                  \
-        topK[J].compValIdx = pairMin;                                                                                  \
-    }
+template <typename RedType>
+__device__ inline void topkSwap(RedType* topK, int i, int j)
+{
+    auto pairMin = min(topK[i].compValIdx, topK[j].compValIdx);
+    auto pairMax = max(topK[i].compValIdx, topK[j].compValIdx);
+    topK[i].compValIdx = pairMax;
+    topK[j].compValIdx = pairMin;
+}

Then update the usage in the Sort specializations (lines 137, 146-148, 157-161):

-        TOPK_SWAP(0, 1);
+        topkSwap(topK, 0, 1);
cpp/tensorrt_llm/kernels/topkLastDim.cu (2)

1195-1204: Potential issue with minScore initialization for half/bfloat16 types

The current implementation uses std::is_floating_point_v which returns false for half and __nv_bfloat16 types, causing them to use std::numeric_limits<InputT>::lowest() which is undefined for these types.

Add explicit handling for CUDA extended float types:

     InputT minScore;
-    if constexpr (std::is_floating_point_v<InputT>)
+    if constexpr (std::is_same_v<InputT, half> || std::is_same_v<InputT, __nv_bfloat16>)
     {
-        minScore = InputT{-INFINITY};
+        minScore = static_cast<InputT>(-CUDART_INF_F);
+    }
+    else if constexpr (std::is_floating_point_v<InputT>)
+    {
+        minScore = -std::numeric_limits<InputT>::infinity();
     }
     else
     {
         minScore = std::numeric_limits<InputT>::lowest();
     }

1227-1231: Type mismatch in output casting

The code casts warpTopKScore[laneIdx] to float when writing to the output, but the output type should be OutputT according to the template parameter.

-            out[outputOffset + laneIdx] = static_cast<float>(warpTopKScore[laneIdx]);
+            out[outputOffset + laneIdx] = static_cast<OutputT>(warpTopKScore[laneIdx]);
🧹 Nitpick comments (11)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1)

201-201: Remove #undef for non-existent macro after conversion

Once the TOPK_SWAP macro is converted to a function template, this #undef becomes unnecessary.

-#undef TOPK_SWAP
cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (1)

98-108: Duplicate TORCH_LIBRARY_FRAGMENT registration

There are two separate TORCH_LIBRARY_FRAGMENT(trtllm, m) blocks (lines 86-91 and 98-103). These should be combined into a single fragment for cleaner code organization.

Combine the two fragments:

 TORCH_LIBRARY_FRAGMENT(trtllm, m)
 {
     m.def(
         "renorm_moe_routing_op(Tensor router_logits, SymInt topk"
         ") -> (Tensor, Tensor)");
-}
-
-TORCH_LIBRARY_IMPL(trtllm, CUDA, m)
-{
-    m.impl("renorm_moe_routing_op", &torch_ext::renorm_moe_routing_op);
-}
-
-TORCH_LIBRARY_FRAGMENT(trtllm, m)
-{
     m.def(
         "default_moe_routing_op(Tensor router_logits, SymInt topk"
         ") -> (Tensor, Tensor)");
 }
 
 TORCH_LIBRARY_IMPL(trtllm, CUDA, m)
 {
     m.impl("renorm_moe_routing_op", &torch_ext::renorm_moe_routing_op);
-}
-
-TORCH_LIBRARY_IMPL(trtllm, CUDA, m)
-{
     m.impl("default_moe_routing_op", &torch_ext::default_moe_routing_op);
 }
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (9)

57-58: Avoid implicit double math in device code: prefer expf (or __expf) for float.

Calling exp with float promotes to double on device. Use expf for correctness and performance (or __expf if fast-math is acceptable).

Apply this diff:

-        newScore = static_cast<float>(exp(newScore));
+        newScore = expf(newScore);

-        scores[i] = static_cast<DataType>(exp(scores[i] - maxScore));
+        scores[i] = static_cast<DataType>(expf(scores[i] - maxScore));

Also applies to: 87-88


102-106: Add const and restrict qualifiers to improve aliasing assumptions and memory throughput.

routerLogits is read-only; mark it const. Adding __restrict__ on pointers can help the compiler generate better code.

Apply this diff:

-__global__ void renormMoeRoutingKernel(InputT* routerLogits, OutputT* topkValues, IdxT* topkIndices,
+__global__ void renormMoeRoutingKernel(InputT const* __restrict__ routerLogits,
+    OutputT* __restrict__ topkValues, IdxT* __restrict__ topkIndices,
     int32_t const numTokens, int32_t const numExperts, int32_t const topK, bool const normTopkProb)

212-215: Mirror const/aliasing qualifiers on the host launcher’s signature.

Keep the API consistent with the kernel signature and allow callers to pass const buffers.

Apply this diff:

-template <typename InputT, typename OutputT, typename IdxT, bool DoSoftmaxBeforeTopK>
-void invokeRenormMoeRouting(InputT* routerLogits, OutputT* topkValues, IdxT* topkIndices, int64_t const numTokens,
+template <typename InputT, typename OutputT, typename IdxT, bool DoSoftmaxBeforeTopK>
+void invokeRenormMoeRouting(InputT const* routerLogits, OutputT* topkValues, IdxT* topkIndices,
+    int64_t const numTokens,
     int64_t const numExperts, int64_t const topK, cudaStream_t const stream)

Note: This change also requires updating the corresponding declaration and call sites.


220-223: Unreachable CASE(96): switch dispatch rounds to power-of-two; use 32-multiple rounding to hit 96.

maxNumExperts is computed with nextPowerOfTwo, so 96 becomes 128 and CASE(96) is never selected. If you intend to specialize 96, round up to the next multiple of 32 (warp size) instead of power-of-two.

Apply this diff:

-    uint32_t maxNumExperts = nextPowerOfTwo(numExperts) < 32 ? 32 : nextPowerOfTwo(numExperts);
+    // Round up to the next multiple of 32, clamp to [32, 128] so CASE(96) is reachable.
+    uint32_t maxNumExperts = static_cast<uint32_t>(
+        std::max<int64_t>(32, std::min<int64_t>(128, ((numExperts + 31) / 32) * 32)));

Alternatively, if 96 is not required, remove CASE(96) to avoid confusion.

Also applies to: 228-233


235-238: Simplify redundant null-check.

The if around the check is unnecessary; assert directly for clarity.

Apply this diff:

-    if (kernelInstance == nullptr)
-    {
-        TLLM_CHECK_WITH_INFO(kernelInstance != nullptr, "Can not find corresponding kernel instance.");
-    }
+    TLLM_CHECK_WITH_INFO(kernelInstance != nullptr, "Cannot find corresponding kernel instance.");

173-190: nextPowerOfTwo: consider a wider type or bit-twiddling impl.

The function takes int while callers pass int64_t. For large inputs this truncates. Not critical here (dispatch capped at 128), but for generality consider a uint32_t/uint64_t bit-hack that avoids looping and overflow checks.

Example replacement:

static inline uint32_t nextPowerOfTwo(uint32_t n) {
    if (n == 0) return 1u;
    n--;
    n |= n >> 1; n |= n >> 2; n |= n >> 4;
    n |= n >> 8; n |= n >> 16;
    return n + 1;
}

35-37: Constant naming style.

Per coding guidelines, constants prefer kPREFIXED_UPPER_SNAKE. Consider kBLOCK_SIZE, kWARP_SIZE, kWARPS_PER_BLOCK.

No functional change; up to you if you want to align naming now or defer.


1-15: Copyright year.

Guidelines require the current year in headers. Please update 2019–2023 to include 2025.

Apply this diff:

- * Copyright (c) 2019-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2019-2025, NVIDIA CORPORATION.  All rights reserved.

102-171: Pointer reads from routerLogits are potentially uncoalesced; consider vectorized loads.

Optional: For larger expert counts, consider reading routerLogits as float2/float4 (or __half2) where alignment permits, to improve memory throughput. The per-lane stride is currently 1; vectorization could help if you also adjust the local storage to vector lanes.

Happy to sketch a vectorized load path if you want to explore this optimization.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 42107ef and 9e459cc.

📒 Files selected for processing (9)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1 hunks)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/topkLastDim.cu (13 hunks)
  • cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (5 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/routing.py (1 hunks)
💤 Files with no reviewable changes (1)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h
  • tensorrt_llm/_torch/modules/fused_moe/routing.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
🧠 Learnings (8)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{h,hpp,hxx,hh,cuh} : Header files must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh} : In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh} : Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (13)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (2)

165-171: Well-structured Top-K reduction implementation

The reduceTopK function is well-designed with clear template parameters for compile-time optimization, appropriate static assertions for parameter validation, and efficient warp-level parallelism. The approach of using cooperative groups for reduction is appropriate for this use case.


203-204: Namespace closing comments follow guidelines

The namespace closing comments correctly follow the coding guidelines format.

cpp/tensorrt_llm/thop/CMakeLists.txt (1)

86-86: Build configuration correctly updated

The replacement of renormMoeRoutingOp.cpp with customMoeRoutingOp.cpp in the CMakeLists.txt file is consistent with the refactoring described in the PR summary.

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)

535-541: New operator registration follows existing patterns

The new default_moe_routing_op registration correctly follows the same pattern as the existing renorm_moe_routing_op, returning consistent tensor types (int32 for indices, float32 for values) with the expected shape.

cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (2)

28-73: Well-designed template architecture for MOE routing

The templated custom_moe_routing_op function provides good flexibility for different routing behaviors while maintaining type safety with explicit template instantiations for all supported data types.


75-83: Clean wrapper functions for API compatibility

The wrapper functions renorm_moe_routing_op and default_moe_routing_op provide a clean interface that preserves backward compatibility while delegating to the templated implementation.

cpp/tensorrt_llm/kernels/topkLastDim.cu (5)

31-32: Good addition of necessary includes for MOE integration

The new includes for moeTopKFuncs.cuh and cooperative_groups are appropriate for supporting the MOE Top-K functionality. The thrust iterators are correctly added to replace the Cub equivalents.


1174-1233: Well-structured MOE Top-K kernel implementation

The moe_topk_kernel is well-designed with:

  • Efficient warp-level parallelism
  • Clear template parameters for compile-time optimization
  • Proper use of cooperative groups for reduction
  • Appropriate handling of edge cases with minScore initialization

1524-1541: Simple and effective power-of-two calculation

The nextPowerOfTwo function is correctly implemented with overflow protection. Good defensive programming.


1543-1607: Well-organized MOE routing dispatch logic

The moe_reduce_topk function provides a clean dispatch mechanism with appropriate template specializations for different input sizes and Top-K values. The nested switch statements, while verbose, ensure optimal kernel selection at runtime.


1661-1671: Smart routing between MOE and stable radix implementations

The conditional routing logic (lines 1661-1671) intelligently selects the specialized MOE kernel for small inputs (≤128 elements, k≤8) while falling back to the stable radix sort for larger inputs. This optimization should provide significant performance benefits for common MOE use cases.

cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (2)

217-223: Guard zero-token launches.

If numTokens is 0, numBlocks becomes 0 and kernel launch is undefined. If upstream guarantees numTokens > 0, ignore this. Otherwise, early-out to avoid invalid launch.

Do callers guarantee numTokens > 0? If not, add:

if (numTokens <= 0) {
    return;
}

I can scan the repository for call sites and add an early-return consistently if you want.


145-156: normTopkProb is hard-coded to false — is renormalization intentionally disabled?

Even when DoSoftmaxBeforeTopK is true, normTopkProb is forced to false at the launcher. If the API is for “renorm” routing, consider exposing a flag to enable renorm over top-k mass.

Should we add a launcher argument for normTopkProb (default false) and thread it through?
I can provide a small patch if renormalization needs to be controllable at runtime.

Also applies to: 220-223

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15432 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11628 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15576 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15568 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (2)

115-121: Macro TOPK_SWAP: convert to inline device function

Per guidelines, avoid macros. Replace TOPK_SWAP with an inline device function.

-#define TOPK_SWAP(I, J)                                                                                                \
-    {                                                                                                                  \
-        auto pairMin = min(topK[I].compValIdx, topK[J].compValIdx);                                                    \
-        auto pairMax = max(topK[I].compValIdx, topK[J].compValIdx);                                                    \
-        topK[I].compValIdx = pairMax;                                                                                  \
-        topK[J].compValIdx = pairMin;                                                                                  \
-    }
+template <typename RedType>
+__device__ inline void topkSwap(RedType* topK, int i, int j)
+{
+    auto pairMin = min(topK[i].compValIdx, topK[j].compValIdx);
+    auto pairMax = max(topK[i].compValIdx, topK[j].compValIdx);
+    topK[i].compValIdx = pairMax;
+    topK[j].compValIdx = pairMin;
+}

Outside this hunk, update call sites:

  • Sort<2>::run: replace TOPK_SWAP(0, 1) with topkSwap(topK, 0, 1)
  • Sort<3>::run: replace 3 occurrences
  • Sort<4>::run: replace 5 occurrences

83-95: Wrong reducer: cg::greater returns bool; use cg::maximum for argmax-style reduction

cg::reduce expects a reducer returning the same type as the inputs. cg::greater returns bool and leads to incorrect results.

Apply this diff:

-        if constexpr (!kTLLM_GEN_HAS_FAST_REDUX || sizeof(TypeCmp) == 8)
+        if constexpr (!kTLLM_GEN_HAS_FAST_REDUX || sizeof(TypeCmp) == 8)
         {
-            return cg::reduce(warp, compValIdx, cg::greater<TypeCmp>{});
+            return cg::reduce(warp, compValIdx, cg::maximum<TypeCmp>{});
         }
         else
         {
             TypeCmp result;
             asm("redux.sync.max.u32 %0, %1, 0xffffffff;\n" : "=r"(result) : "r"(compValIdx));
             return result;
         }
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (4)

35-38: Constants should use kPREFIXED_UPPER_SNAKE

Rename to follow guidelines and update references (BLOCK_SIZE→kBLOCK_SIZE, WARP_SIZE→kWARP_SIZE, WARPS_PER_BLOCK→kWARPS_PER_BLOCK).

-static constexpr int BLOCK_SIZE = 1024;
-static constexpr int WARP_SIZE = 32;
-static constexpr int WARPS_PER_BLOCK = BLOCK_SIZE / WARP_SIZE;
+static constexpr int kBLOCK_SIZE = 1024;
+static constexpr int kWARP_SIZE = 32;
+static constexpr int kWARPS_PER_BLOCK = kBLOCK_SIZE / kWARP_SIZE;

Outside this hunk, update usages:

  • thread tiling: cg::tiled_partition<kWARP_SIZE>
  • index math: tIdx / kWARP_SIZE, tIdx % kWARP_SIZE
  • grid math: gridDim.x * kWARPS_PER_BLOCK
  • block dim: dim3(..., kBLOCK_SIZE)

41-50: Wrong reducer in softmax: use cg::maximum instead of cg::greater

cg::greater returns bool and breaks reduction. Use cg::maximum().

-    maxScore = cg::reduce(warp, maxScore, cg::greater<T>());
+    maxScore = cg::reduce(warp, maxScore, cg::maximum<T>());

70-99: Wrong reducer in vector softmax: use cg::maximum instead of cg::greater

Same issue in the array version.

-    maxScore = cg::reduce(warp, maxScore, cg::greater<DataType>());
+    maxScore = cg::reduce(warp, maxScore, cg::maximum<DataType>());

116-116: Unsafe -INFINITY initialization for half/bfloat16 in non-softmax path

When DoSoftmaxBeforeTopK=false and InputT is half/bfloat16, BaseType{-INFINITY} is fragile. Initialize via CUDA float INF and cast.

-    BaseType minScore = BaseType{-INFINITY};
+    BaseType minScore;
+    if constexpr (std::is_same_v<BaseType, half> || std::is_same_v<BaseType, __nv_bfloat16>)
+    {
+        minScore = static_cast<BaseType>(-CUDART_INF_F);
+    }
+    else
+    {
+        minScore = BaseType{-INFINITY};
+    }
🧹 Nitpick comments (7)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1)

165-173: Static-assert message is misleading

The code enforces N < 5, but the message reads “<= 128”. Correct the message to avoid confusion.

-    static_assert(N < 5, "Only support candidates number less than or equal to 128");
+    static_assert(N < 5, "Only support candidates number <= 4");
cpp/tensorrt_llm/kernels/topkLastDim.cu (3)

1201-1245: Warp-level MOE TopK kernel: solid structure; one minor nit on bounds

Kernel correctly vectorizes across MaxLen/kWARP_SIZE lanes and uses reduce_topk. Note: MaxLen switch below includes 96 but dispatcher uses nextPowerOfTwo, so 96 variants are currently unreachable. Consider rounding len up to the next multiple of 32 (not power-of-two) if you want to hit 96-specialized kernels.


1555-1618: Dispatcher never selects 96-case; also, unused ‘greater’ param

  • The dispatcher uses nextPowerOfTwo for max_len, making the case 96 unreachable.
  • The greater parameter is unused in moe_reduce_topk. Since invokeTopkLastDim guards is_largest==true for this path, either drop the parameter or pass is_largest for clarity.

Optional improvements:

  • Round up to multiples of 32 and clamp to 128 to enable the 96-case:
    max_len = min(128u, ((len + 31u) & ~31u));
  • Remove or repurpose ‘greater’.

1672-1682: Pass correct intent for ‘greater’ param (or remove it)

The call passes !is_largest although the branch only executes when is_largest is true. If you keep the parameter, pass is_largest to reflect intent:

-        moe_reduce_topk(in, batchSize, inputLength, k, out_val_, out_idx_, !is_largest, stream);
+        moe_reduce_topk(in, batchSize, inputLength, k, out_val_, out_idx_, is_largest, stream);

Given ‘greater’ is currently unused in moe_reduce_topk, consider removing it entirely to avoid confusion.

cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (3)

109-115: Follow-up after constant rename: update WARP/BLOCK usages

If you adopt k-prefixed constants, also update:

  • cg::thread_block_tile<WARP_SIZE> → cg::thread_block_tile<kWARP_SIZE>
  • array dimensions: MaxNumExperts / kWARP_SIZE
  • element index math: i * kWARP_SIZE + laneIdx
  • block size in launch: dim3(..., kBLOCK_SIZE)

Minimizes future drift and keeps style consistent.

Also applies to: 122-141


167-184: Duplicate nextPowerOfTwo helper

Same helper exists in topkLastDim.cu. Prefer a shared utility header to avoid duplication.


211-216: Dispatcher includes unreachable 96-experts case

maxNumExperts uses nextPowerOfTwo, so CASE(96) is never selected. Either:

  • switch to rounding up to nearest multiple of 32 to enable 96, or
  • drop CASE(96) to reduce compile-time footprint.

Not functionally harmful, but dead code is a maintenance burden.

Also applies to: 219-227

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 25322f6 and 778c576.

📒 Files selected for processing (10)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h (2 hunks)
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1 hunks)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/topkLastDim.cu (13 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (1 hunks)
  • cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (5 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/routing.py (1 hunks)
💤 Files with no reviewable changes (1)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu
🚧 Files skipped from review as they are similar to previous changes (5)
  • cpp/tensorrt_llm/thop/CMakeLists.txt
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/routing.py
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
🧠 Learnings (9)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{h,hpp,hxx,hh,cuh} : Header files must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh} : In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh} : Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (1)

201-209: PDL_PROFILE macro rename is consistent

No instances of the old FDL_PROFILE macro remain in the codebase. Both RoutingRenormalize.cu and RoutingLlama4.cu now correctly gate programmatic launches with PDL_PROFILE, and a global search confirms no residual FDL_PROFILE usage. No further changes are required.

cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1)

17-20: Guards and namespace endings look good

Include guard presence and named namespace closing comments align with repository guidelines.

Also applies to: 203-205

cpp/tensorrt_llm/kernels/topkLastDim.cu (2)

1175-1181: Good: constants follow kPREFIXED_UPPER_SNAKE

kBLOCK_SIZE/kWARP_SIZE/kWARPS_PER_BLOCK naming complies with guidelines and improves readability.


1182-1199: Good: -INF handling for half/bfloat16

Specializations avoid fragile brace-initialization and use CUDA constants for fp16/bf16. This prevents compile/runtime issues when forming sentinel minima.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15576 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11729 completed with status: 'ABORTED'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15600 [ run ] triggered by Bot

@ChristinaZ ChristinaZ force-pushed the feat_opt_default_routing branch from 778c576 to 055facc Compare August 18, 2025 10:24
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (7)
cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (3)

48-51: Use data_ptr() for read-only router_logits; avoid mutable_data_ptr on inputs

router_logits is an input and should not use mutable_data_ptr(). Use typed data_ptr for clarity and const-correct intent. Outputs can keep mutable_data_ptr().

Apply:

-        tk::invokeRenormMoeRouting<float, float, int32_t, DoSoftmaxBeforeTopK>(
-            reinterpret_cast<float*>(router_logits.mutable_data_ptr()),
-            reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
-            reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()), num_tokens, num_experts, topk, stream);
+        tk::invokeRenormMoeRouting<float, float, int32_t, DoSoftmaxBeforeTopK>(
+            router_logits.data_ptr<float>(),
+            topk_values.mutable_data_ptr<float>(),
+            topk_indices.mutable_data_ptr<int32_t>(), num_tokens, num_experts, topk, stream);

55-59: BF16 pointer cast: cast from at::BFloat16, not raw mutable_data_ptr

Use the BF16-typed data_ptr and then reinterpret_cast to __nv_bfloat16*; keep outputs as mutable_data_ptr.

-        tk::invokeRenormMoeRouting<__nv_bfloat16, float, int32_t, DoSoftmaxBeforeTopK>(
-            reinterpret_cast<__nv_bfloat16*>(router_logits.mutable_data_ptr()),
-            reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
-            reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()), num_tokens, num_experts, topk, stream);
+        tk::invokeRenormMoeRouting<__nv_bfloat16, float, int32_t, DoSoftmaxBeforeTopK>(
+            reinterpret_cast<__nv_bfloat16*>(router_logits.data_ptr<at::BFloat16>()),
+            topk_values.mutable_data_ptr<float>(),
+            topk_indices.mutable_data_ptr<int32_t>(), num_tokens, num_experts, topk, stream);

62-66: Half pointer cast: use typed data_ptrat::Half() for router_logits

Same rationale as BF16/FP32.

-        tk::invokeRenormMoeRouting<half, float, int32_t, DoSoftmaxBeforeTopK>(
-            reinterpret_cast<half*>(router_logits.mutable_data_ptr()),
-            reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
-            reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()), num_tokens, num_experts, topk, stream);
+        tk::invokeRenormMoeRouting<half, float, int32_t, DoSoftmaxBeforeTopK>(
+            reinterpret_cast<half*>(router_logits.data_ptr<at::Half>()),
+            topk_values.mutable_data_ptr<float>(),
+            topk_indices.mutable_data_ptr<int32_t>(), num_tokens, num_experts, topk, stream);
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (4)

35-38: Rename constants to follow kPREFIXED_UPPER_SNAKE convention

Per coding guidelines, constants should be k-prefixed. Update names and references.

-static constexpr int BLOCK_SIZE = 1024;
-static constexpr int WARP_SIZE = 32;
-static constexpr int WARPS_PER_BLOCK = BLOCK_SIZE / WARP_SIZE;
+static constexpr int kBLOCK_SIZE = 1024;
+static constexpr int kWARP_SIZE = 32;
+static constexpr int kWARPS_PER_BLOCK = kBLOCK_SIZE / kWARP_SIZE;

And update all occurrences (tIdx/warpIdx/warpNum, blockDim, gridDim computations).


41-50: Fix cg::reduce reducer: use cg::maximum (cg::greater returns bool)

cg::reduce requires an associative reducer returning T. cg::greater is a comparator (returns bool). Use cg::maximum.

-    maxScore = cg::reduce(warp, maxScore, cg::greater<T>());
+    maxScore = cg::reduce(warp, maxScore, cg::maximum<T>());

70-83: Fix cg::reduce reducer in vector softmax as well

Same issue as above for DataType.

-    maxScore = cg::reduce(warp, maxScore, cg::greater<DataType>());
+    maxScore = cg::reduce(warp, maxScore, cg::maximum<DataType>());

117-117: Initialize -inf safely for half/bfloat16

Brace-initializing from -INFINITY is fragile for CUDA extended types. Cast from float INFINITY for fp16/bf16.

-    BaseType minScore = BaseType{-INFINITY};
+    BaseType minScore;
+    if constexpr (std::is_same_v<BaseType, half> || std::is_same_v<BaseType, __nv_bfloat16>)
+    {
+        minScore = static_cast<BaseType>(-CUDART_INF_F);
+    }
+    else
+    {
+        minScore = BaseType{-INFINITY};
+    }
🧹 Nitpick comments (3)
cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (1)

68-70: Replace throw with TORCH_CHECK to match extension error handling

Throwing std::invalid_argument from a Torch extension is not idiomatic. Prefer TORCH_CHECK for reporting invalid dtypes.

-        // Handle other data types
-        throw std::invalid_argument("Invalid dtype, only supports float32, float16 and bfloat16");
-        break;
+        TORCH_CHECK(false, "Invalid dtype: only supports float32, float16 and bfloat16");
cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1554-1566: Prefer std::numeric_limits over raw INT_MAX in nextPowerOfTwo

Since you added , this compiles, but using std::numeric_limits::max() is clearer and portable.

-        if (power > INT_MAX / 2)
+        if (power > std::numeric_limits<int>::max() / 2)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1)

129-136: Load path: align with InputT const correctness if possible

routerLogits are read-only. If feasible, make kernel parameter InputT const* and propagate const to invokeRenormMoeRouting. This will allow thop to avoid mutable_data_ptr on inputs. If ABI constraints prevent this now, consider as a follow-up.

I can generate the const-propagation diff across .h/.cu and op call sites if you want to take it in this PR.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 778c576 and 055facc.

📒 Files selected for processing (12)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h (2 hunks)
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1 hunks)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/topkLastDim.cu (13 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (2 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (0 hunks)
  • cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (5 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/routing.py (1 hunks)
💤 Files with no reviewable changes (3)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu
🚧 Files skipped from review as they are similar to previous changes (5)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/thop/CMakeLists.txt
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h
  • tensorrt_llm/_torch/modules/fused_moe/routing.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
🧠 Learnings (7)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
📚 Learning: 2025-08-14T15:36:37.576Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/kernels/mlaKernels.cu:436-439
Timestamp: 2025-08-14T15:36:37.576Z
Learning: CUDA kernels prioritize performance and should avoid runtime bounds checking or conditional operations that cause branching/warp divergence. Input validation should be done at the host level before kernel launch, not per-thread in the kernel.

Applied to files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh} : Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)

Applied to files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
🔇 Additional comments (6)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (2)

381-386: PDL trigger is now correctly gated by arch; LGTM

Switching from PDL_PROFILE to CUDA_ARCH >= 900 while keeping the runtime UsePdl guard is correct and aligns with other files in this PR.


759-767: Secondary-kernel trigger visibility caveat — verify consumer sync

The trigger call is guarded by arch and UsePdl, but as the comment says, it doesn’t guarantee visibility of prior writes. Ensure the consumer performs cudaGridDependencySynchronize (or equivalent) before reading the produced data.

Would you like me to grep the call sites in this PR to confirm we do a dependency sync before consumption?

cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (2)

75-83: Wrapper split looks good

The templated core with wrappers for renorm and default routing keeps the API surface clean and enables compile-time specialization of DoSoftmaxBeforeTopK.


98-108: Operator registration: OK

Defining a new fragment for default_moe_routing_op and wiring CUDA impls matches how renorm_moe_routing_op is handled. Names are consistent with Python-side adapters.

cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1672-1682: Runtime dispatch between MOE and radix path: LGTM

The small-input fast path is constrained to inputLength <= 128, k <= 8, is_largest == true; otherwise the stable radix path is used and sorted=true is preserved. Good tradeoff.

cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1)

234-249: cudaLaunchKernelEx usage: LGTM

Programmatic stream serialization attribute is correctly wired to env flag. Good use of cudaLaunchKernelEx for function-pointer-based dispatch.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15600 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11744 completed with status: 'FAILURE'

…topk for trt backend

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
@ChristinaZ ChristinaZ force-pushed the feat_opt_default_routing branch from 055facc to 7fd0d1d Compare August 18, 2025 15:24
@ChristinaZ
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (2)

49-50: Fix reduction op: use cg::maximum for max reductions (cg::greater returns bool)

cg::reduce’s operator must return the same type as the inputs. cg::greater returns bool, producing incorrect results.

Apply:

-    maxScore = cg::reduce(warp, maxScore, cg::greater<T>());
+    maxScore = cg::reduce(warp, maxScore, cg::maximum<T>());

and

-    maxScore = cg::reduce(warp, maxScore, cg::greater<DataType>());
+    maxScore = cg::reduce(warp, maxScore, cg::maximum<DataType>());

Also applies to: 82-83


117-117: Robust -INF initialization for half/bfloat16

Brace-initializing BaseType{-INFINITY} is unsafe for half/bfloat16 and can miscompile. Cast CUDART_INF_F.

Apply:

-    BaseType minScore = BaseType{-INFINITY};
+    BaseType minScore;
+    if constexpr (std::is_same_v<BaseType, half> || std::is_same_v<BaseType, __nv_bfloat16>)
+    {
+        minScore = static_cast<BaseType>(-CUDART_INF_F);
+    }
+    else
+    {
+        minScore = BaseType{-INFINITY};
+    }
cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (1)

48-66: Use typed data_ptr<> for read-only inputs; avoid mutable_data_ptr on router_logits

router_logits is an input and should use data_ptr<>. mutable_data_ptr() is only for outputs. Also prefer typed data_ptr for outputs where possible.

Apply:

-        tk::invokeRenormMoeRouting<float, float, int32_t, DoSoftmaxBeforeTopK>(
-            reinterpret_cast<float*>(router_logits.mutable_data_ptr()),
-            reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
-            reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()), num_tokens, num_experts, topk, stream);
+        tk::invokeRenormMoeRouting<float, float, int32_t, DoSoftmaxBeforeTopK>(
+            router_logits.data_ptr<float>(),
+            topk_values.data_ptr<float>(),
+            topk_indices.data_ptr<int32_t>(), num_tokens, num_experts, topk, stream);
@@
-        tk::invokeRenormMoeRouting<__nv_bfloat16, float, int32_t, DoSoftmaxBeforeTopK>(
-            reinterpret_cast<__nv_bfloat16*>(router_logits.mutable_data_ptr()),
-            reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
-            reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()), num_tokens, num_experts, topk, stream);
+        tk::invokeRenormMoeRouting<__nv_bfloat16, float, int32_t, DoSoftmaxBeforeTopK>(
+            reinterpret_cast<__nv_bfloat16*>(router_logits.data_ptr<at::BFloat16>()),
+            topk_values.data_ptr<float>(),
+            topk_indices.data_ptr<int32_t>(), num_tokens, num_experts, topk, stream);
@@
-        tk::invokeRenormMoeRouting<half, float, int32_t, DoSoftmaxBeforeTopK>(
-            reinterpret_cast<half*>(router_logits.mutable_data_ptr()),
-            reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
-            reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()), num_tokens, num_experts, topk, stream);
+        tk::invokeRenormMoeRouting<half, float, int32_t, DoSoftmaxBeforeTopK>(
+            reinterpret_cast<half*>(router_logits.data_ptr<at::Half>()),
+            topk_values.data_ptr<float>(),
+            topk_indices.data_ptr<int32_t>(), num_tokens, num_experts, topk, stream);
cpp/tensorrt_llm/kernels/topkLastDim.cu (3)

31-36: Include for INT_MAX used by nextPowerOfTwo

Missing header will break builds on some toolchains.

Apply:

 #include <cuda/std/limits>
 #include <limits>
+#include <climits>
 #include <thrust/iterator/counting_iterator.h>
 #include <thrust/iterator/transform_iterator.h>
 #include <type_traits>

1188-1198: Use CUDART_INF_F and cast; FP16/BF16 INF macros are non-standard

CUDART_INF_FP16/CUDART_INF_BF16 are not portable. Cast float INF.

Apply:

 template <>
 __device__ half negativeInfinity<half>()
 {
-    return -CUDART_INF_FP16;
+    return static_cast<half>(-CUDART_INF_F);
 }
 
 template <>
 __device__ __nv_bfloat16 negativeInfinity<__nv_bfloat16>()
 {
-    return -CUDART_INF_BF16;
+    return static_cast<__nv_bfloat16>(-CUDART_INF_F);
 }

1614-1618: Cannot launch global via function pointer with <<<>>>; use cudaLaunchKernel

Triple-chevron syntax with a function pointer is illegal. Switch to cudaLaunchKernel and pass an argv array.

Apply:

-    kernel_instance<<<moe_topk_grid_dim, moe_topk_block_dim, 0, stream>>>(in, out, out_idx, batch_size, len, k);
+    void* args[] = {
+        const_cast<void*>(reinterpret_cast<void const*>(&in)),
+        reinterpret_cast<void*>(&out),
+        reinterpret_cast<void*>(&out_idx),
+        reinterpret_cast<void*>(&batch_size),
+        reinterpret_cast<void*>(&len),
+        reinterpret_cast<void*>(&k),
+    };
+    auto st = cudaLaunchKernel(reinterpret_cast<void const*>(kernel_instance),
+                               moe_topk_grid_dim, moe_topk_block_dim, args, 0, stream);
+    TLLM_CHECK_WITH_INFO(st == cudaSuccess, "cudaLaunchKernel(moe_topk_kernel) failed");
🧹 Nitpick comments (5)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (3)

35-37: Adopt kPREFIXED_UPPER_SNAKE for constants and update all references

Per repo guidelines, constants should be k-prefixed. Also update all usages below (tIdx, warpIdx, laneIdx, warpNum, blockDim).

Apply:

-static constexpr int BLOCK_SIZE = 1024;
-static constexpr int WARP_SIZE = 32;
-static constexpr int WARPS_PER_BLOCK = BLOCK_SIZE / WARP_SIZE;
+static constexpr int kBLOCK_SIZE = 1024;
+static constexpr int kWARP_SIZE = 32;
+static constexpr int kWARPS_PER_BLOCK = kBLOCK_SIZE / kWARP_SIZE;

And update references:

-    uint32_t const tIdx = BLOCK_SIZE * blockRank + threadIdx.x;
-    uint32_t const warpIdx = tIdx / WARP_SIZE;
-    uint32_t const laneIdx = tIdx % WARP_SIZE;
-    uint32_t const warpNum = gridDim.x * WARPS_PER_BLOCK;
+    uint32_t const tIdx = kBLOCK_SIZE * blockRank + threadIdx.x;
+    uint32_t const warpIdx = tIdx / kWARP_SIZE;
+    uint32_t const laneIdx = tIdx % kWARP_SIZE;
+    uint32_t const warpNum = gridDim.x * kWARPS_PER_BLOCK;
@@
-    auto warp = cg::tiled_partition<WARP_SIZE>(block);
+    auto warp = cg::tiled_partition<kWARP_SIZE>(block);
@@
-    dim3 renormMoeRoutingBlockDim(BLOCK_SIZE);
+    dim3 renormMoeRoutingBlockDim(kBLOCK_SIZE);
@@
-    const uint32_t numBlocks = std::min(static_cast<uint32_t>((numTokens - 1) / WARPS_PER_BLOCK + 1), maxNumBlocks);
+    const uint32_t numBlocks = std::min(static_cast<uint32_t>((numTokens - 1) / kWARPS_PER_BLOCK + 1), maxNumBlocks);

110-116: Update tiled partition template parameter to match renamed kWARP_SIZE

If you rename the constants, also fix the template parameter; otherwise it won’t compile.

Apply:

-    auto warp = cg::tiled_partition<WARP_SIZE>(block);
+    auto warp = cg::tiled_partition<kWARP_SIZE>(block);

220-227: Unreachable CASE(96) with nextPowerOfTwo; either remove or round to multiple of 32

maxNumExperts is computed via nextPowerOfTwo(…) with a minimum of 32; it will never be 96. Either remove CASE(96) or compute maxNumExperts as roundUpToMultipleOf(numExperts, 32) to actually select 96.

Option A (simpler): Remove the 96 case.

Option B (better specialization): replace maxNumExperts computation:

-    uint32_t maxNumExperts = nextPowerOfTwo(numExperts) < 32 ? 32 : nextPowerOfTwo(numExperts);
+    uint32_t maxNumExperts = static_cast<uint32_t>(((numExperts + 31) / 32) * 32);
cpp/tensorrt_llm/kernels/topkLastDim.cu (2)

1554-1613: Optional: remove unused ‘greater’ parameter in moe_reduce_topk or wire it into reduceTopK

The ‘greater’ flag is passed but never used; either drop it or plumb it through to select min/max variants.


1535-1552: nextPowerOfTwo uses INT_MAX; with header added it’s fine. Consider constexpr and unsigned arithmetic

Not blocking, but making it constexpr on device/host and using unsigned avoids UB on negatives.

Example:

-int nextPowerOfTwo(int num)
+__host__ __device__ constexpr unsigned nextPowerOfTwo(unsigned num)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 055facc and 7fd0d1d.

📒 Files selected for processing (12)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h (2 hunks)
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1 hunks)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/topkLastDim.cu (13 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (2 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (0 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (0 hunks)
  • cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (5 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/routing.py (1 hunks)
💤 Files with no reviewable changes (3)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
🚧 Files skipped from review as they are similar to previous changes (6)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/thop/CMakeLists.txt
  • tensorrt_llm/_torch/modules/fused_moe/routing.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
🧠 Learnings (6)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

  • cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh} : Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)

Applied to files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
🧬 Code Graph Analysis (3)
cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (1)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (2)
  • invokeRenormMoeRouting (208-249)
  • invokeRenormMoeRouting (208-209)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (3)
  • void (81-226)
  • void (296-299)
  • void (305-366)
cpp/tensorrt_llm/kernels/topkLastDim.cu (4)
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu (3)
  • void (71-99)
  • void (105-166)
  • T (42-68)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (3)
  • void (69-290)
  • void (350-353)
  • void (360-410)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (3)
  • void (81-226)
  • void (296-299)
  • void (305-366)
cpp/include/tensorrt_llm/common/cudaUtils.h (1)
  • getMultiProcessorCount (393-400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp (3)

28-38: Add input shape checks before extracting sizes

Call TORCH_CHECK on dimensionality before indexing sizes; you already do it, so this is good.


72-73: Confirm output tensor order (indices, values) matches schema and Python callers

The function returns {topk_indices, topk_values}. Ensure this order matches the Torch schema and downstream usage to avoid subtle bugs.

If needed, I can scan for callsites to validate expected order.


98-108: Nice: registers both renorm and default routing ops cleanly

Operator registration looks consistent and properly split by fragment/impl.

cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1176-1245: MOE small-input TopK kernel looks solid; cooperative_groups usage is correct

Warp tiling, bounds checks, and reduceTopK integration read clean. Once INF init fix lands, this path should be robust.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15628 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15628 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11765 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@ChristinaZ
Copy link
Collaborator Author

/bot --post-merge

@github-actions
Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@QiJune
Copy link
Collaborator

QiJune commented Aug 20, 2025

@ChristinaZ Please cherry pick this fix to release/1.0 branch as well.

@ChristinaZ
Copy link
Collaborator Author

@ChristinaZ Please cherry pick this fix to release/1.0 branch as well.

OK. Working on it.

@litaotju
Copy link
Collaborator

What performance tests had been done for this kernel optimization?

How it affects the DS R1 performance on Hopper and Blackwell? Thx

Also I didn't see any model file change after ading this new kernel, do we need that.

@ChristinaZ
Copy link
Collaborator Author

What performance tests had been done for this kernel optimization?

This PR mainly addresses a performance issue in the TRT backend. A key performance check involves using the moe_regression.py script from the NVBug. As observed in the NVBug, the topK operation was previously very slow because the topK plugin is optimized for large candidate numbers, not for cases like MoE. With this PR, the topK calculation time has been reduced to less than 4 microseconds.

How it affects the DS R1 performance on Hopper and Blackwell? Thx.

No, because the DS R1 uses their own customized routing method (https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L283). It doesn't use the default routing method we mentioned in this PR. So this PR doesn't influence the performance of DS R1

Also I didn't see any model file change after ading this new kernel, do we need that.

For now we don't need to change the model file. DeepSeek, Llama4, and Qwen3 don't use the default routing method. While some previous models might need it. As this is the default method and our key customer needs it, we still optimized it.

@litaotju litaotju merged commit c7269ea into NVIDIA:main Aug 21, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants