KEMBAR78
[None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend by ChristinaZ · Pull Request #6794 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@ChristinaZ
Copy link
Collaborator

@ChristinaZ ChristinaZ commented Aug 11, 2025

Add topk_ids and topk_weight as inputs for moe trtllgen backend

Add the two inputs topk_ids and topk_weights first.
Then, enable post-quantization allgather for the Moe TrtllmGen backend.

  1. Routing kernels
    I modified the kernels for DeepSeek and Llama 4 and renormalized routing methods in files RoutingDeepSeek.cu, RoutingRenormalize.cu and RoutingLlama4.cu
  • Related C++ unit tests are also added in files like routingTest.cpp.
  • Besides that, I updated the default value of maxExpertIdx in file RoutingLlama4.cu, so that it can handle NaN correctly.
  • Added more detailed type checks like this:
        if (static_cast<RoutingMethodType>(routing_method_type) == RoutingMethodType::DeepSeekV3)
        {
            TORCH_CHECK(routing_logits.value().scalar_type() == at::ScalarType::Float, "routing_logits must be float");
        }
        else
        {
            TORCH_CHECK(
                routing_logits.value().scalar_type() == at::ScalarType::BFloat16, "routing_logits must be bfloat16");
        }
  • Changed the names of several variables to add new inputs (for example, changed mPtrExpertIdx to mPtrTopKPacked).

Hi @MatthiasKohl and @nekorobov, could you please review this part?

  1. Add two parameters (topk_ids and topk_weights) in all the MoE trtllm-gen Ops. For example:
    https://gitlab-master.nvidia.com/ftp/tekit/-/merge_requests/9611/diffs?file=1e58f3c66b27c6a2f10bc87b0a3381f3d846762c#1e58f3c66b27c6a2f10bc87b0a3381f3d846762c_673_731

Related unit tests in file tests/unittest/_torch/thop/test_moe.py

  1. @DomBrown Hi Dom, this is the same PR you helped review before.

  2. Hi Daniel, as we are going to reuse the routing part for the post-quant all-gather of MoE Trtllm Gen backend, I might need BF16 output for Trtllm Gen backend. I added a parameter in the routing method for CUTLASS. Maybe you can take a look? (For example routing.py)

  3. @rosenrodt Hi Anthony, I added the post-quant allgather logic. Please help review it fused_moe_trtllm_gen.py

Test Coverage

cd cpp/build
make -j$(nproc) google-tests
./tests/unit_tests/kernels/routingKernelsTest

pytest -v -s tests/unittest/_torch/thop/test_moe.py

pytest -s -o log_cli=true "tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=TRTLLM-mtp_nextn=2-tp4-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]"

Summary by CodeRabbit

  • New Features

    • Optional Top‑K routing inputs (ids + weights) and dual Top‑K output formats; routing no longer requires per‑token logits.
  • Improvements

    • Kernelized histogram init and unified Top‑K handling across FP4/FP8/BF16 with device-aware allocations and preserved legacy score path.
    • Routing output dtype now configurable (bf16 or fp32); autotuning/custom ops propagate Top‑K data.
  • Tests

    • Expanded unit tests to exercise Top‑K‑as‑input flows across backends.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 11, 2025

📝 Walkthrough

Walkthrough

Adds an optional Top‑K input/output path across BlockScale MoE: introduces mPtrTopKIds/mPtrTopKWeights/mPtrTopKPacked, a new routingInitExpertCounts kernel, and branches host/device flows to use Top‑K inputs when provided; propagates these changes through runners, THOP wrappers, fused MoE modules, custom routing ops/kernels, and tests.

Changes

Cohort / File(s) Summary
Routing kernels: DeepSeek / Llama4 / Renormalize
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu, .../RoutingLlama4.cu, .../RoutingRenormalize.cu
Add dual Top‑K paths (mPtrTopKIds/mPtrTopKWeights/mPtrTopKPacked), gate per‑token flows on TopK vs scores, adapt top‑group/top‑expert selection and coop path to read either TopKIds or TopKPacked, and replace cudaMemset init with routingInitExpertCounts.
Kernel interfaces & helpers
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h, .../RoutingKernel.cuh, .../RoutingKernelTopK.cuh
Public KernelParams and DataBase gain Top‑K pointers; references to mPtrExpertIdx/Weights renamed to TopK variants; add __global__ routingInitExpertCounts and Top‑K aware histogram/offset kernels; small TopK comparator/packing tweak.
Runner API / dispatch
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h, .../runner.cu
Runner::run signature updated to accept int32_t* expertIds; routingData wires TopKPacked/TopKWeights/TopKIds and sets mPtrScores nullable when TopKIds provided; host dispatch branches to init kernel + Top‑K path or legacy scores path.
THOP C++ runners (FP4/FP8/MxE)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp, .../fp8BlockScaleMoe.cpp, .../fp8PerTensorScaleMoe.cpp, .../mxFp4BlockScaleMoe.cpp
Make routing_logits optional; add optional topk_weights/topk_ids, validate per active path, choose routing device/stream from active input, alias expert_weights_ptr to topk_weights when used, allocate routing workspace on routing device, and forward Top‑K inputs to Runner.
Python wrappers / custom ops
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
Make routing_logits optional in input dataclasses and runner wrappers; add optional topk_weights/topk_ids; synthesize logits for autotune when missing and ensure top‑K inputs are forwarded to kernel runners.
Custom routing op / kernel renames & dtype
cpp/tensorrt_llm/kernels/customMoeRoutingKernels.cu, .../customMoeRoutingKernels.h, cpp/tensorrt_llm/thop/customMoeRoutingOp.cpp, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Rename invokeRenormMoeRoutinginvokeCustomMoeRouting; extend routing ops to accept optional output_dtype, implement runtime branching/instantiation for fp32/bf16/half outputs, and update Torch registrations and Python fake ops to accept output_dtype.
Fused MoE & routing methods
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py, .../routing.py
Add output_dtype parameter to routing methods and wire it through softmax/sigmoid/TRTL calls; add DP post‑quant allgather path producing token_selected_experts/token_final_scales, quantize per‑backend, and forward top‑K results into MOE runner calls.
Model callsites
tensorrt_llm/_torch/models/modeling_deepseekv3.py, .../modeling_gpt_oss.py, .../modeling_qwen3_moe.py
Remove certain DP allgather branches for TRTLLM fused MoE, pass backend‑aware output_dtype into routing method constructors, and adjust imports/conditions accordingly.
C++ kernel tests
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp, .../routingLlama4Test.cpp, .../routingRenormalizeTest.cpp, .../routingTest.cpp, .../routingTest.h
Tests switched to TopK‑centric buffers (TopKPacked/TopKIds/TopKWeights), add useTopKAsInput flag, add test variants exercising Top‑K‑as‑input and update allocation, host/device copy and verification flows.
Python unit tests
tests/unittest/_torch/thop/test_moe.py
Add use_topk_as_input parameterization, compute and forward topk_ids/topk_weights in tests when enabled, and extend test invocations across FP8/FP4/Mx variants.
Misc: test bindings & small tweaks
cpp/tests/..., tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Update fake ops to accept output_dtype and return second tensor with configurable dtype; small comparator refactor in TopK helper.

Sequence Diagram(s)

sequenceDiagram
  actor PyCaller as Python
  participant Wrapper as Torch wrapper
  participant Fused as FusedMoE
  participant Runner as C++ Runner
  participant Kernels as CUDA kernels

  PyCaller->>Wrapper: call runner(routing_logits?, topk_weights?, topk_ids?)
  Wrapper->>Fused: prepare inputs (synthesize logits for autotune if None) + topk inputs
  alt DP post-quant path (fused)
    Fused->>Fused: post-quant allgather -> token_selected_experts, token_final_scales
    Fused->>Runner: run(expertWeightsPtr, expertIds=token_selected_experts, ...)
  else Normal path
    Fused->>Runner: run(expertWeightsPtr, expertIds=topk_ids or nullptr, ...)
  end
  alt TopKIds provided
    Runner->>Kernels: launch routingInitExpertCounts
    Runner->>Kernels: launch TopK-driven routing kernels (read TopKIds/TopKWeights/TopKPacked)
  else Scores path (no TopKIds)
    Runner->>Kernels: launch routingMainKernel (compute scores -> TopKPacked/Weights)
  end
  Kernels-->>Runner: return TopK outputs (packed/ids/weights)
  Runner-->>Fused: routing results
  Fused-->>Wrapper: final outputs
  Wrapper-->>PyCaller: return tensors
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

SW Architecture, weekly_release_blocker

Suggested reviewers

  • rosenrodt
  • nekorobov
  • byshiue
  • yizhang-nv
  • litaotju

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ChristinaZ ChristinaZ changed the title [None][feat] Enable run_post_quant_allgather for MoE TRTLLM backend [None][feat] Enable run_post_quant_allgather for MoE TRTLLM backend Aug 11, 2025
@ChristinaZ ChristinaZ changed the title [None][feat] Enable run_post_quant_allgather for MoE TRTLLM backend [None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend Aug 11, 2025
@ChristinaZ ChristinaZ changed the title [None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend [None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend Aug 11, 2025
@ChristinaZ
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14804 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🔭 Outside diff range comments (3)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

280-281: Comment inconsistency with code logic.

The comment mentions "Let CutlassFusedMoE and TRTLLMGenFusedMoE handle allgather internally", but TRTLLMGenFusedMoE has been removed from the imports and is no longer checked in the condition at line 267.

Update the comment to reflect the current implementation:

-        # Let CutlassFusedMoE and TRTLLMGenFusedMoE handle allgather internally
+        # Let CutlassFusedMoE handle allgather internally
cpp/tests/unit_tests/kernels/routing/routingTest.cpp (1)

299-321: Add direct verification for TopK IDs when provided as input

When useTopKAsInput is true, also assert the device TopK IDs equal the host-generated IDs to catch any unintended modifications.

@@
-    auto const expertWeightsHost = mBufferManager->copyFrom(*mPtrTopKWeightsDevice, MemoryType::kCPU);
+    auto const expertWeightsHost = mBufferManager->copyFrom(*mPtrTopKWeightsDevice, MemoryType::kCPU);
+    if (param.useTopKAsInput)
+    {
+        auto const topKIdsHostCopy = mBufferManager->copyFrom(*mPtrTopKIdsDevice, MemoryType::kCPU);
+        assertEqual(bufferCast<int32_t>(*mPtrTopKIdsHost), bufferCast<int32_t>(*topKIdsHostCopy),
+            param.numTokens * param.topK, "topk ids");
+    }
@@
     if (param.getExpWeights)
     {
         EXPECT_EQ(isClose(bufferCast<T>(*mPtrTopKWeightsHost), expertWeightsPtr, param.numTokens * param.topK,
                       "expert weights"),
             true);
     }
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)

2-2: Update copyright year to include 2025

According to the coding guidelines, the copyright header should include the current year.

- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
🧹 Nitpick comments (11)
cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp (2)

36-36: Consider documenting the topk <= 8 restriction more prominently.

The comment mentions this is a temporary restriction that should be removed later. Consider tracking this technical debt with a GitHub issue and reference it in the TODO comment for better visibility.

-    TORCH_CHECK(topk <= 8, "topk should be smaller than or equal to 8 for now"); //@todo: remove this restriction later
+    TORCH_CHECK(topk <= 8, "topk should be smaller than or equal to 8 for now"); // TODO(#ISSUE_NUMBER): remove this restriction later

52-100: Consider refactoring the repetitive dtype branching logic.

The switch-case contains repetitive patterns that could be simplified using template instantiation or a helper function to reduce code duplication and improve maintainability.

tensorrt_llm/_torch/modules/fused_moe/routing.py (2)

264-265: Redundant float() conversion before sigmoid.

The .float() conversion on line 265 appears redundant when output_dtype is already torch.float32. Consider conditionally applying the conversion only when needed.

-        return topk_indices.to(torch.int32), torch.sigmoid(
-            topk_values.float()).to(self.output_dtype)
+        sigmoid_input = topk_values if topk_values.dtype == torch.float32 else topk_values.float()
+        return topk_indices.to(torch.int32), torch.sigmoid(sigmoid_input).to(self.output_dtype)

395-397: Redundant storage of parameters in child class.

The RenormalizeNaiveMoeRoutingMethod stores top_k and output_dtype locally even though the parent class already stores them. This creates redundancy.

 def __init__(self, top_k: int, output_dtype: torch.dtype = torch.float32):
     super().__init__(top_k, output_dtype)
-    self.top_k = top_k
-    self.output_dtype = output_dtype
cpp/tests/unit_tests/kernels/routing/routingTest.cpp (1)

104-107: Consider alignment guarantees for PackedType buffers

mPtrTopKPackedHost/Device are allocated as INT8 with size in bytes and then reinterpreted as PackedType*. While this is likely fine on GPU, host-side aliasing/alignment could be fragile. If feasible, prefer allocations that guarantee alignment to alignof(PackedType), or document BufferManager’s alignment guarantees for pinned CPU buffers.

cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (1)

38-143: Reduce duplicated host TopK computation across tests (optional)

computeTopKExperts implementations across Renormalize/Llama4/DeepSeek are largely duplicated with small differences. Consider lifting shared logic into the base (e.g., a templated helper) to reduce maintenance burden.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (1)

792-817: Initialize-expert-counts kernel: const-qualify loop bounds and indices

Minor cleanup: variables not modified after init should be const to follow code guidelines.

-    int32_t expertCountsNum = 2 * params.mNumExperts;
-    int32_t globalThreadIdx = blockIdx.x * NumThreadsHist + threadIdx.x;
-    int32_t globalThreadStride = gridDim.x * NumThreadsHist;
+    int32_t const expertCountsNum = 2 * params.mNumExperts;
+    int32_t const globalThreadIdx = blockIdx.x * NumThreadsHist + threadIdx.x;
+    int32_t const globalThreadStride = gridDim.x * NumThreadsHist;
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)

98-116: Optional: provide a backward-compatible run overload

If you need to preserve source compatibility for downstream code, consider adding an overload that forwards expertIds=nullptr to the new signature.

I can draft the overload in runner.cu to forward to the new signature if helpful.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)

249-251: Improve error message specificity.

The error message should list the supported quantization modes for better debugging.

-                raise ValueError(
-                    f"unsupported quantization mode with run_post_quant_allgather: {self.quant_config.quant_mode}"
-                )
+                raise ValueError(
+                    f"Unsupported quantization mode '{self.quant_config.quant_mode}' with run_post_quant_allgather. "
+                    f"Supported modes: fp8_qdq, nvfp4, w4a8_mxfp4_fp8, w4a8_mxfp4_mxfp8, w4a16_mxfp4"
+                )
tests/unittest/_torch/thop/test_moe.py (1)

1546-1550: Consider relaxing test constraints for broader coverage.

The current constraints limit use_topk_as_input testing to a very specific configuration. Consider testing with additional configurations to ensure broader compatibility.

The TopK-as-input functionality should ideally work with various dtype_activation values and top_k settings. Consider adding at least one more configuration variant to improve test coverage.

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)

289-292: Consider simplifying tensor list construction

The current pattern of appending then replacing could be simplified for better readability:

-    input_tensors = input_tensors_for_tuner + [topk_weights, topk_ids]
-    input_tensors[
-        0] = routing_logits  # replace dummy routing logits with actual routing logits
+    # Build final input list with actual routing_logits
+    input_tensors = [routing_logits] + input_tensors_for_tuner[1:] + [topk_weights, topk_ids]

This makes the intent clearer without the need for index replacement.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5145e9d and d2096a7.

📒 Files selected for processing (25)
  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (4 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (5 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (9 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (11 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (7 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu (5 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (2 hunks)
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (8 hunks)
  • cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (6 hunks)
  • cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (6 hunks)
  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (12 hunks)
  • cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp (3 hunks)
  • cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (9 hunks)
  • cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (5 hunks)
  • cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (7 hunks)
  • cpp/tests/unit_tests/kernels/routing/routingTest.cpp (8 hunks)
  • cpp/tests/unit_tests/kernels/routing/routingTest.h (6 hunks)
  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (27 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py (4 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (7 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/routing.py (5 hunks)
  • tests/unittest/_torch/thop/test_moe.py (14 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu
  • cpp/tests/unit_tests/kernels/routing/routingTest.cpp
  • cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h
  • cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
  • tensorrt_llm/_torch/modules/fused_moe/routing.py
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
  • cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp
  • cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu
  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
  • cpp/tests/unit_tests/kernels/routing/routingTest.h
  • cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp
  • cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu
  • tests/unittest/_torch/thop/test_moe.py
  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tests/unit_tests/kernels/routing/routingTest.cpp
  • cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h
  • cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
  • cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp
  • cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
  • cpp/tests/unit_tests/kernels/routing/routingTest.h
  • cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp
  • cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp
  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
  • tensorrt_llm/_torch/modules/fused_moe/routing.py
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
  • tests/unittest/_torch/thop/test_moe.py
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
  • cpp/tests/unit_tests/kernels/routing/routingTest.h
🧠 Learnings (2)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp
  • cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu
  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
🧬 Code Graph Analysis (18)
cpp/tests/unit_tests/kernels/routing/routingTest.cpp (4)
cpp/tests/unit_tests/kernels/routing/routingTest.h (2)
  • useTopKAsInput (225-484)
  • topK (210-210)
cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (6)
  • param (39-91)
  • param (39-39)
  • param (93-99)
  • param (93-93)
  • param (110-116)
  • param (110-111)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (6)
  • param (39-144)
  • param (39-39)
  • param (146-152)
  • param (146-146)
  • param (175-181)
  • param (175-176)
cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (6)
  • value (65-74)
  • value (65-65)
  • topK (134-134)
  • topK (140-140)
  • topK (149-149)
  • topK (160-160)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)
tensorrt_llm/_torch/distributed/ops.py (1)
  • MoEAllReduce (547-635)
tensorrt_llm/_torch/modules/embedding.py (1)
  • Embedding (164-242)
tensorrt_llm/_torch/modules/fused_moe/routing.py (1)
  • DeepSeekV3MoeRoutingMethod (204-214)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)
  • create_moe (60-211)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (6)
tensorrt_llm/_torch/distributed/ops.py (1)
  • allgather (141-225)
tensorrt_llm/functional.py (10)
  • allgather (4142-4226)
  • shape (270-274)
  • shape (277-282)
  • shape (2056-2096)
  • dtype (255-259)
  • dtype (262-267)
  • view (408-412)
  • view (1745-1790)
  • flatten (414-418)
  • flatten (1793-1828)
tensorrt_llm/_torch/utils.py (3)
  • Fp4QuantizedTensor (92-99)
  • shape (98-99)
  • _ (185-191)
tensorrt_llm/_torch/models/modeling_qwen3_moe.py (1)
  • routing_method (65-78)
tensorrt_llm/_torch/modules/fused_moe/routing.py (8)
  • apply (160-170)
  • apply (191-197)
  • apply (238-245)
  • apply (259-265)
  • apply (300-336)
  • apply (353-355)
  • apply (367-389)
  • apply (399-403)
tensorrt_llm/_torch/modules/fused_moe/interface.py (4)
  • has_fp8_qdq (118-121)
  • has_w4a8_mxfp4_fp8 (136-139)
  • has_nvfp4 (130-133)
  • has_w4a8_mxfp4_mxfp8 (142-145)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (2)
cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (6)
  • param (39-91)
  • param (39-39)
  • param (93-99)
  • param (93-93)
  • param (110-116)
  • param (110-111)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (8)
  • param (47-152)
  • param (47-47)
  • param (154-165)
  • param (154-154)
  • param (167-176)
  • param (167-167)
  • param (192-198)
  • param (192-193)
tensorrt_llm/_torch/modules/fused_moe/routing.py (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
  • top_k (226-226)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
  • apply (337-342)
  • apply (402-404)
tensorrt_llm/functional.py (2)
  • Tensor (107-602)
  • topk (7308-7404)
tensorrt_llm/_torch/models/modeling_gpt_oss.py (2)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)
  • create_moe (60-211)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
  • top_k (226-226)
tensorrt_llm/_torch/models/modeling_qwen3_moe.py (5)
tensorrt_llm/_torch/distributed/ops.py (1)
  • MoEAllReduce (547-635)
tensorrt_llm/runtime/generation.py (1)
  • ModelConfig (608-654)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • fused_moe (120-232)
tensorrt_llm/_torch/modules/fused_moe/routing.py (10)
  • BaseMoeRoutingMethod (158-181)
  • RenormalizeMoeRoutingMethod (217-249)
  • RenormalizeNaiveMoeRoutingMethod (392-407)
  • RoutingMethodType (143-155)
  • routing_method_type (180-181)
  • routing_method_type (200-201)
  • routing_method_type (213-214)
  • routing_method_type (248-249)
  • routing_method_type (268-269)
  • routing_method_type (406-407)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (2)
  • RoutingMethodType (39-117)
  • top_k (226-226)
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (5)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
  • routing_logits (338-348)
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1)
  • routing_logits (302-309)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (2)
  • routing_logits (437-447)
  • routing_logits (496-509)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (7)
  • num_experts (219-219)
  • top_k (226-226)
  • n_group (227-227)
  • topk_group (229-229)
  • intermediate_size (231-231)
  • local_expert_offset (232-232)
  • local_num_experts (233-233)
tensorrt_llm/_torch/modules/fused_moe/routing.py (6)
  • routing_method_type (180-181)
  • routing_method_type (200-201)
  • routing_method_type (213-214)
  • routing_method_type (248-249)
  • routing_method_type (268-269)
  • routing_method_type (406-407)
cpp/tensorrt_llm/thop/renormMoeRoutingOp.cpp (1)
cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (2)
  • invokeRenormMoeRouting (322-363)
  • invokeRenormMoeRouting (322-323)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
  • routing_logits (338-348)
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1)
  • routing_logits (302-309)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (2)
  • routing_logits (437-447)
  • routing_logits (496-509)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (8)
  • do_finalize (251-252)
  • num_experts (219-219)
  • top_k (226-226)
  • n_group (227-227)
  • topk_group (229-229)
  • intermediate_size (231-231)
  • local_expert_offset (232-232)
  • local_num_experts (233-233)
tensorrt_llm/_torch/autotuner.py (1)
  • choose_one (319-453)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (3)
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1)
  • routing_logits (302-309)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (5)
  • num_experts (219-219)
  • top_k (226-226)
  • getMaxPermutedPaddedCount (68-74)
  • intermediate_size (231-231)
  • getMaxNumCtasInBatchDim (76-96)
cpp/tensorrt_llm/kernels/quantization.h (1)
  • computeSwizzledLayoutSFSize (53-58)
cpp/tests/unit_tests/kernels/routing/routingTest.h (4)
cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (6)
  • param (39-91)
  • param (39-39)
  • param (93-99)
  • param (93-93)
  • param (110-116)
  • param (110-111)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (6)
  • param (39-144)
  • param (39-39)
  • param (146-152)
  • param (146-146)
  • param (175-181)
  • param (175-176)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (4)
  • param (47-152)
  • param (47-47)
  • param (154-165)
  • param (154-154)
cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (1)
  • T (213-239)
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (5)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (4)
  • routing_logits (338-348)
  • routing_runner (203-203)
  • nodiscard (338-368)
  • nodiscard (370-374)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (7)
  • routing_logits (437-447)
  • routing_logits (496-509)
  • routing_runner (233-233)
  • nodiscard (430-434)
  • nodiscard (437-465)
  • nodiscard (490-494)
  • nodiscard (496-527)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (9)
  • num_experts (219-219)
  • top_k (226-226)
  • n_group (227-227)
  • topk_group (229-229)
  • intermediate_size (231-231)
  • local_expert_offset (232-232)
  • local_num_experts (233-233)
  • getMaxPermutedPaddedCount (68-74)
  • getMaxNumCtasInBatchDim (76-96)
tensorrt_llm/_torch/modules/fused_moe/routing.py (6)
  • routing_method_type (180-181)
  • routing_method_type (200-201)
  • routing_method_type (213-214)
  • routing_method_type (248-249)
  • routing_method_type (268-269)
  • routing_method_type (406-407)
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (1)
  • routing_runner (195-195)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (2)
cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (6)
  • param (39-91)
  • param (39-39)
  • param (93-99)
  • param (93-93)
  • param (110-116)
  • param (110-111)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (6)
  • param (39-144)
  • param (39-39)
  • param (146-152)
  • param (146-146)
  • param (175-181)
  • param (175-176)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (1)
  • routingTopKExperts (32-35)
tests/unittest/_torch/thop/test_moe.py (5)
tensorrt_llm/_torch/autotuner.py (1)
  • autotune (195-206)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
  • fp8_block_scale_moe_runner (511-574)
cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (6)
  • param (39-91)
  • param (39-39)
  • param (93-99)
  • param (93-93)
  • param (110-116)
  • param (110-111)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (6)
  • param (39-144)
  • param (39-39)
  • param (146-152)
  • param (146-146)
  • param (175-181)
  • param (175-176)
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (2)
  • fp8_per_tensor_scale_moe_runner (28-283)
  • fp8_per_tensor_scale_moe_runner (28-37)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (6)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
  • routing_logits (338-348)
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1)
  • routing_logits (302-309)
tensorrt_llm/_torch/modules/fused_moe/routing.py (6)
  • routing_method_type (180-181)
  • routing_method_type (200-201)
  • routing_method_type (213-214)
  • routing_method_type (248-249)
  • routing_method_type (268-269)
  • routing_method_type (406-407)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
  • hidden_states (2160-2160)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (5)
  • num_experts (219-219)
  • top_k (226-226)
  • getMaxPermutedPaddedCount (68-74)
  • intermediate_size (231-231)
  • getMaxNumCtasInBatchDim (76-96)
cpp/tensorrt_llm/kernels/quantization.h (1)
  • computeSwizzledLayoutSFSize (53-58)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (3)
cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (2)
  • reduceTopK (171-172)
  • reduce (88-100)
cpp/kernels/xqa/mha.cu (1)
  • warp (307-307)
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu (1)
  • __syncthreads (127-144)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py

56-56: Redefinition of unused ModelConfig from line 55

(F811)


56-56: Redefinition of unused QuantConfig from line 47

(F811)

🪛 Gitleaks (8.27.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

222-222: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (42)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (2)

216-219: Verify the conditional logic for TopKWeights population.

The condition params.mPtrTopKWeights != nullptr && params.mPtrTopKIds == nullptr might create an unexpected state where weights are populated but IDs are not. Please verify this is the intended behavior for backward compatibility.


456-461: LGTM! Clear error messages for dual-path routing.

The updated validation checks and error messages properly communicate the requirements for the new dual-path routing support.

cpp/tests/unit_tests/kernels/routing/routingTest.h (1)

223-225: LGTM! Test parameter properly added for dual-path testing.

The useTopKAsInput parameter is correctly integrated into the test framework to support testing both routing paths.

cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (2)

193-194: Good use of ternary operator for dual-path support.

The expert_weights_ptr setup correctly handles both the case where topk_weights is provided as input and when it needs to be computed.


63-67: Branch-specific shape validations are correct
The token count check for routing_logits only applies in the routing-logits path, and in the topk_ids/topk_weights path we already validate that both topk_ids and topk_weights match hidden_states on the token dimension (and their second dimension matches top_k). No additional cross-validation is needed.

Optional: you may add a brief inline comment above each if/else if to document that each branch enforces its own shape requirements.

cpp/tensorrt_llm/kernels/renormMoeRoutingKernels.cu (1)

370-377: LGTM! Template instantiations properly extend type support.

The new instantiations correctly add support for BF16 output types, which aligns with the PR's goal of supporting BF16 output for TRTLLM routing.

tensorrt_llm/_torch/models/modeling_gpt_oss.py (1)

153-156: LGTM! Proper output_dtype configuration for TRTLLM backend.

The conditional selection of BF16 for TRTLLM backend and FP32 for others is correct and aligns with the PR's routing enhancements.

tensorrt_llm/_torch/models/modeling_qwen3_moe.py (1)

67-75: LGTM! Consistent output_dtype handling across routing methods.

The output_dtype configuration for both RenormalizeNaiveMoeRoutingMethod and RenormalizeMoeRoutingMethod correctly uses BF16 for TRTLLM backend, maintaining consistency with other MoE implementations.

cpp/tests/unit_tests/kernels/routing/routingTest.cpp (3)

75-89: TopK host/device buffers allocation looks correct

Allocation for TopK weights is unconditional and TopK IDs are gated by useTopKAsInput. This matches the intended dual-path behavior.


124-169: Using TopKPacked for permutation is consistent with new data path

Switching from expert index arrays to TopKPacked for computing token->expert routing is correct and consistent across the test suite.

Also applies to: 171-185


360-367: Ordering of host preparation and H2D copies is sensible

Calling callHostFunction before launching the device kernels and then copying TopK inputs when useTopKAsInput is enabled is correct.

cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (2)

127-141: TopK packed + conditional TopKIds/TopKWeights host writes are correct

The conversion to PackedType and conditional writes for TopKIds/TopKWeights align with the new dual-input path.


186-205: Expanded test coverage for TopK-as-input scenarios is good

The added variants with useTopKExpertsAsInput true/false and adjusted numTokens effectively exercise both paths.

Also applies to: 208-227, 230-249, 252-271

cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (4)

73-89: TopK packed + conditional TopKIds/TopKWeights host writes are correct

Store to TopKPacked and optionally expose TopKIds/TopKWeights. This matches the intended input/output contract for kernels.


101-108: Passing device TopKPacked into routingData is correct

setParams now routes mPtrTopKPacked, aligning with kernels that can consume packed TopK when TopKIds inputs aren’t provided.


121-153: Good addition of TopK-as-input test variants

Warp/Cluster/Device variants with useTopKExpertsAsInput=true expand coverage for the new input path.

Also applies to: 155-187


58-65: Follow up on comparator tie-breaker TODO

The comparator tie-breaker for equal scores is marked TODO. Ensure consistent behavior with device-side selection to avoid nondeterministic test failures.

Would you like me to extract the device-side comparator semantics and mirror them here to guarantee determinism?

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (3)

235-244: Dual source load for TopK indices/weights is correct

routingPermutation’s LoadExpertIdxFromGlobal branch properly supports TopKIds/TopKWeights or TopKPacked fallback.


259-262: Conditional writeback of weights respects TopKIds-as-input path

Only materializing weights when mPtrTopKIds == nullptr avoids clobbering input-provided weights. Good.


646-653: Offsets kernel correctly derives expert index from TopKIds or TopKPacked

This path does not depend on TopKWeights presence, avoiding the histogram kernel issue. Good.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)

202-205: MoERunnerArgs: New optional TopK inputs are clearly documented

The additional topk_weights/topk_ids inputs align with the new data path and are nullable by default.

cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (3)

76-118: LGTM! Well-structured grouped vs non-grouped top-K selection.

The implementation correctly handles both grouped and non-grouped scenarios with clear separation of logic. The finalTopkExperts array properly consolidates results from both paths.


135-150: Output buffer writing logic is correct.

The implementation properly handles all three output modes while maintaining backward compatibility with the packed format.


214-221: Good test coverage with the new TopK-as-input test case.

The new test case ClusterLevelParallelizationWithTopKAsInput properly exercises the TopK input path with appropriate parameters.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (2)

254-267: Allgather implementation is correct.

The scaling factor reshaping and flattening logic properly handles the distributed gathering operation.


298-335: NVFP4 path correctly handles both quantization modes.

The conditional quantization and new TopK parameters are properly integrated.

tests/unittest/_torch/thop/test_moe.py (2)

842-854: Good test parameterization for TopK input modes.

The parameterization with descriptive IDs ("use_score_as_input", "use_topk_as_input") makes the test intent clear.


914-927: TopK parameter extraction is correct.

The conditional logic properly extracts and converts TopK IDs and weights with appropriate data types.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (2)

61-77: Well-documented TopK data structure changes.

The new TopK members are clearly documented with dimensions and purpose, making the API intent clear.


121-146: Consistent parameter structure updates.

The KernelParamsBase changes properly mirror the DataBase structure with appropriate type casting.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu (3)

59-65: Clean API signature update.

The new expertIds parameter is logically positioned and the signature reorganization improves clarity.


76-90: Correct routing data setup for DeepSeekV3.

The conditional scores assignment and TopK data wiring properly support both routing paths.


116-174: Consistent routing setup across all methods.

The Llama4 and Renormalize paths correctly implement the same TopK data wiring pattern as DeepSeekV3.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (3)

377-379: Good fix for cluster synchronization

Making the cluster barrier unconditional ensures proper synchronization across all threads, regardless of token validity. This prevents potential race conditions.


470-479: Comprehensive input validation

The updated validation logic properly handles the new TopK input paths and provides clear error messages for debugging.


537-541: Good architectural improvement with dedicated initialization kernel

Using routingInitExpertCounts kernel instead of cudaMemsetAsync provides better performance and consistency for histogram initialization.

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)

30-44: Excellent consistency across all MoE variants

The implementation maintains perfect consistency across all 5 MoE variants (FP4, FP8, MxE4m3MxE2m1, E4m3MxE2m1, Bf16MxE2m1), which improves maintainability and reduces the chance of variant-specific bugs.

Also applies to: 343-354, 605-620, 873-890, 1123-1137

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (2)

150-152: Correct cluster synchronization

The unconditional cluster barrier ensures all threads synchronize properly before proceeding with the permutation logic, preventing race conditions.


225-225: Appropriate output format change

Writing to mPtrTopKPacked using the PackedScoreIdx format aligns with the new TopK routing architecture and provides a consistent output format.

cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (2)

55-91: Comprehensive input validation with clear error messages

The dual-path validation logic properly handles both routing_logits and TopK inputs with appropriate dtype and shape checks. The error messages are informative for debugging.


165-165: Excellent device management consistency

The implementation correctly derives the device from the active input and consistently uses it for all allocations and stream selection, preventing cross-device issues.

Also applies to: 234-235

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)

47-82: Well-structured dual routing implementation

The FP4 implementation properly handles both routing_logits and TopK input paths with consistent device management and comprehensive validation, mirroring the robust patterns in the MX variant.

Also applies to: 154-154, 197-197, 204-205

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14804 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11176 completed with status: 'FAILURE'

Copy link
Collaborator

@DomBrown DomBrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the MoE autotune parts as we discussed on Gitlab. Others should review the other bits as you suggested :)

Copy link
Collaborator

@djns99 djns99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The routing changes make sense to me. Is this something we should be supporting in the CUTLASS backend, or is this just a quirk of trtllm-gen?

@ChristinaZ
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19151 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19151 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14372 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19196 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19196 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14409 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

…le post quant allgather

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19396 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19396 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14571 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19399 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19399 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14574 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19425 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19425 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14594 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19458 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19458 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #14627 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19557 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19557 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14703 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19597 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19597 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14738 completed with status: 'SUCCESS'

@ChristinaZ ChristinaZ merged commit be576a3 into NVIDIA:main Sep 23, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants