[None][feat] Revise the calculation related to TileN in routing of MOE TRTLLM backend #8148

ChristinaZ · 2025-10-05T11:21:32Z

Summary by CodeRabbit

New Features
- Added support for configurable tile-based token routing, enabling non–power-of-two tiling alongside existing power-of-two paths.
- Expanded routing flexibility across expert permutations, offsets, and index sizing with automatic path selection.
Bug Fixes
- Replaced hard runtime failures with safer fallbacks in configuration detection, improving stability.
- Removed overly strict padding constraints, allowing broader valid configurations.
Tests
- Updated unit tests to include tile token dimension and compute capability parameters.
- Expanded coverage to validate both power-of-two and tile-based routing paths.

Description

Before TRTLLM backend use mPaddingLog2 to accelerate related calculation with function like divUpLog2(), mulLog2() and divUpMulLog2(). However, now the tileN might be a value like 192, which is not a power of 2. So I have to replace them with functions like mulTileN(), divUpTileN(), and divUpMulTileN().

About the performance, I tried to compare the performance. In general, with this modification, its running time extended slightly. For example, with mPaddingLog2=3 (mTileN=8), kernel routingRenormalize::routingIndicesClusterKernel can observe 3% performance regression.

So I think it's better to add one more template parameter so that it can still use the previous variable mPaddingLog2.

Test Coverage

./tests/unit_tests/kernels/routingKernelsTest
pytest -v -s tests/unittest/_torch/thop/parallel/test_moe.py

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

ChristinaZ · 2025-10-05T12:19:04Z

/bot run

tensorrt-cicd · 2025-10-05T12:24:33Z

PR_Github #20647 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-05T12:42:46Z

PR_Github #20647 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #15593 completed with status: 'FAILURE'

ChristinaZ · 2025-10-07T02:54:46Z

/bot run

tensorrt-cicd · 2025-10-07T03:09:31Z

PR_Github #20700 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-07T05:38:01Z

PR_Github #20700 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15638 completed with status: 'FAILURE'

ChristinaZ · 2025-10-07T13:12:58Z

/bot run

tensorrt-cicd · 2025-10-07T13:18:57Z

PR_Github #20732 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-07T14:57:45Z

PR_Github #20732 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15666 completed with status: 'FAILURE'

ChristinaZ · 2025-10-08T08:05:36Z

/bot run

tensorrt-cicd · 2025-10-08T08:10:51Z

PR_Github #20778 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-08T11:02:01Z

PR_Github #20778 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15704 completed with status: 'SUCCESS'

coderabbitai · 2025-10-09T02:18:43Z

📝 Walkthrough

Walkthrough

Adds tile-based tiling alongside power-of-two padding across routing kernels. Introduces mTileTokensDim and isPow2 template parameter. Implements tile arithmetic helpers and switches computations (CTA counts, limits, offsets, sizes) between pow2 and tile paths. Updates launch macros, runner propagation, and unit tests to use tileTokensDim and revised parameterization.

Changes

Cohort / File(s)	Summary of changes
Launch macros and dispatch `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h`	Adds LAUNCH_TILEN macro keyed on mPaddingLog2; replaces LAUNCH_PDL with LAUNCH_TILEN in routing dispatch branches, affecting dtype/expW and extra-flag paths.
Kernel parameterization and data model `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h`	Adds mTileTokensDim to DataBase/KernelParamsBase; introduces template bool isPow2_ across KernelParams; exposes static constexpr isPow2; defaults mPaddingLog2 to -1; setBaseParams now propagates mTileTokensDim.
Tile arithmetic helpers (device headers) `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh`	Adds mulTileN, divUpTileN, divUpMulTileN; switches numCta, mnLimit, offsets, permutedIdxSize to pow2 vs tile branches via constexpr; preserves existing pow2 behavior.
Routing kernels: DeepSeek `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu`	Adds isPow2-controlled paths for numCta, mnLimit, offsets, permutedIdxSize using TileN variants; removes strict padding check; aligns exclusive-sum/index wiring with dual tiling.
Routing kernels: Llama4 `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu`	Branches computations on isPow2 for numCta, mnLimit, offsets, permutedIdxSize; updates finalExpertOffset calculations; removes padding-log2 < 8 check; retains overall flow.
Routing kernels: Renormalize `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu`	Applies isPow2-based branching for counts/limits and permutedIdxSize; updates ExclusiveSum inputs via TileN variants; removes padding check.
Runner and config propagation `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`	computeLog2 now returns -1 on non-pow2; propagates routingData.mTileTokensDim for DeepSeekV3, Llama4, Renormalize paths.
Test infra: helpers and params `cpp/tests/unit_tests/kernels/routing/routingTest.h`	Adds host/device mulTileN/divUpTileN/divUpMulTileN; extends RoutingKernelTestParam with tileTokensDim and requiredComputeCapability (defaulted); propagates mTileTokensDim in setCommonParams.
Test logic updates to tile path `cpp/tests/unit_tests/kernels/routing/routingTest.cpp`	Replaces paddingLog2-based math with tileTokensDim-based (sizes, prefix sums, CTA counts, limits).
Unit tests: param wiring `cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp`, `cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp`	Adds tileTokensDim argument (8) to RoutingKernelTestParam calls across tests.
Unit tests: Llama4 params `cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp`	Adds tileTokensDim (8) and requiredComputeCapability to RoutingKernelTestParam calls.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Host as Host (runner.cu)
  participant KP as KernelParamsBase/Kernels
  participant Math as Tile/Pow2 helpers
  participant Kern as Routing Kernels

  Host->>KP: setBaseParams(data)\n(mTileTokensDim, mPaddingLog2)
  Note over KP: KP::isPow2 (template constexpr)

  KP->>Kern: launch routing kernels
  alt KP::isPow2 == true
    Kern->>Math: divUpLog2/mulLog2 for\nnumCta, mnLimit, offsets, sizes
  else KP::isPow2 == false
    Kern->>Math: divUpTileN/mulTileN for\nnumCta, mnLimit, offsets, sizes
  end

  Kern-->>Host: results (permutedIdx, sizes)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description has a clear “Description,” “Test Coverage,” and checklist, but it still contains the template instructions and the @coderabbitai summary placeholder without an actual summary, so it does not fully follow the required template structure.	Please replace the @coderabbitai summary placeholder with a concise summary of the changes, remove the template instruction block at the top, and ensure the description begins with the filled‐in summary followed by the required sections.
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly summarizes the main change of replacing padding‐based calculations with tileN‐aware helpers in the MOE TRTLLM backend, follows the repository’s “[None][type] Summary” template, and is concise and specific enough for a reviewer to understand the primary update without extraneous details.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9298f1b and 6dbbe92.

📒 Files selected for processing (12)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (2 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (2 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (7 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (7 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (3 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (2 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu (4 hunks)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (8 hunks)
cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (6 hunks)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (10 hunks)
cpp/tests/unit_tests/kernels/routing/routingTest.cpp (5 hunks)
cpp/tests/unit_tests/kernels/routing/routingTest.h (4 hunks)

🧰 Additional context used

📓 Path-based instructions (7)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}