fix: [https://nvbugspro.nvidia.com/bug/5349343] Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) #5519

ChristinaZ · 2025-06-26T12:39:01Z

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4)

Description

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4)
@zhhuang-nv find that: for DeepSeek Lite, the MoE TRT-LLM backend with nvfp4 reports the following error:

[TensorRT-LLM][ERROR] CUDA runtime error in cudaMemsetAsync( data.mPtrExpertCounts, 0, static_cast<size_t>(2 * NumThreads) * sizeof(int32_t), (cudaStream_t) stream): invalid argument (/home/scratch.zhhuang_sw/project/TensorRT-LLM/cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cu:1254).

Through our analysis, we identified the root cause:
The data size used in cudaMemset ( cudaMemsetAsync( data.mPtrExpertCounts, 0, static_cast<size_t>(2 * NumThreads) * sizeof(int32_t)...) exceeds the actual allocated memory size (at::Tensor expert_count_histogram = at::detail::empty_cuda({((num_experts * 2 + 255) / 256) * 256},...).

So I tried to fix this bug by revising the code to

    int const size_of_expert_count_histogram = max(num_experts * 2, 256 * 2);
    at::Tensor expert_count_histogram = at::detail::empty_cuda({size_of_expert_count_histogram},
       ...

Some routing kernels require a compute capability of 9.0 or higher, which the A30 does not support. I had to skip certain tests for now. In this PR, I revised the unit test to use a better approach for skipping the unit tests.

ChristinaZ · 2025-06-26T12:53:56Z

/bot run

tensorrt-cicd · 2025-06-26T12:59:26Z

PR_Github #10036 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-26T13:12:35Z

PR_Github #10036 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #7406 completed with status: 'FAILURE'

ChristinaZ · 2025-06-26T14:43:20Z

/bot run

tensorrt-cicd · 2025-06-26T14:48:57Z

PR_Github #10046 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-26T17:02:50Z

PR_Github #10046 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #7414 completed with status: 'FAILURE'

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

…ipping routing tests for unsupported GPU architectures Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

ChristinaZ · 2025-06-27T06:54:56Z

/bot run

tensorrt-cicd · 2025-06-27T07:00:06Z

PR_Github #10120 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-27T09:03:20Z

PR_Github #10120 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7469 completed with status: 'FAILURE'

ChristinaZ · 2025-06-27T09:26:32Z

/bot run

tensorrt-cicd · 2025-06-27T09:32:05Z

PR_Github #10133 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-27T11:40:11Z

PR_Github #10133 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7480 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

ChristinaZ requested review from MatthiasKohl, byshiue and zhhuang-nv June 26, 2025 12:39

ChristinaZ force-pushed the fix_routing_expertCount_malloc branch 2 times, most recently from b0f5c2e to 702b664 Compare June 26, 2025 12:50

ChristinaZ force-pushed the fix_routing_expertCount_malloc branch from 702b664 to 991b9ed Compare June 26, 2025 14:43

zhhuang-nv reviewed Jun 27, 2025

View reviewed changes

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp Outdated Show resolved Hide resolved

ChristinaZ force-pushed the fix_routing_expertCount_malloc branch from 991b9ed to 44c6b42 Compare June 27, 2025 06:43

Fix the malloc of variable mPtrExpertCounts;Refine the approach of sk…

b67efcb

…ipping routing tests for unsupported GPU architectures Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

ChristinaZ force-pushed the fix_routing_expertCount_malloc branch from 44c6b42 to b67efcb Compare June 27, 2025 06:54

zhhuang-nv approved these changes Jun 27, 2025

View reviewed changes

byshiue approved these changes Jun 27, 2025

View reviewed changes

byshiue merged commit a608b00 into NVIDIA:main Jun 27, 2025
3 checks passed

ChristinaZ changed the title ~~Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4)~~ fix: [https://nvbugspro.nvidia.com/bug/5349343] Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) Jun 30, 2025

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 9, 2025

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (NVIDI…

b8142bf

…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (NVIDI…

056d3cf

…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (NVIDI…

829d1ba

…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (NVIDI…

f20594c

…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (NVIDI…

14e7710

…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (NVIDI…

46f5204

…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (NVIDI…

fe8dd68

…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (NVIDI…

29f60b4

…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

fix: [https://nvbugspro.nvidia.com/bug/5349343] Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) #5519

fix: [https://nvbugspro.nvidia.com/bug/5349343] Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) #5519

Uh oh!

Conversation

ChristinaZ commented Jun 26, 2025

Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4)

Description

Uh oh!

ChristinaZ commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

ChristinaZ commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

Uh oh!

ChristinaZ commented Jun 27, 2025

Uh oh!

tensorrt-cicd commented Jun 27, 2025

Uh oh!

tensorrt-cicd commented Jun 27, 2025

Uh oh!

ChristinaZ commented Jun 27, 2025

Uh oh!

tensorrt-cicd commented Jun 27, 2025

Uh oh!

tensorrt-cicd commented Jun 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants