-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: [https://nvbugspro.nvidia.com/bug/5349343] Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) #5519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: [https://nvbugspro.nvidia.com/bug/5349343] Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) #5519
Conversation
b0f5c2e to
702b664
Compare
|
/bot run |
|
PR_Github #10036 [ run ] triggered by Bot |
|
PR_Github #10036 [ run ] completed with state |
702b664 to
991b9ed
Compare
|
/bot run |
|
PR_Github #10046 [ run ] triggered by Bot |
|
PR_Github #10046 [ run ] completed with state |
991b9ed to
44c6b42
Compare
…ipping routing tests for unsupported GPU architectures Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
44c6b42 to
b67efcb
Compare
|
/bot run |
|
PR_Github #10120 [ run ] triggered by Bot |
|
PR_Github #10120 [ run ] completed with state |
|
/bot run |
|
PR_Github #10133 [ run ] triggered by Bot |
|
PR_Github #10133 [ run ] completed with state |
…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…A#5519) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4)
Description
@zhhuang-nv find that: for DeepSeek Lite, the MoE TRT-LLM backend with nvfp4 reports the following error:
Through our analysis, we identified the root cause:
The data size used in cudaMemset (
cudaMemsetAsync( data.mPtrExpertCounts, 0, static_cast<size_t>(2 * NumThreads) * sizeof(int32_t)...) exceeds the actual allocated memory size (at::Tensor expert_count_histogram = at::detail::empty_cuda({((num_experts * 2 + 255) / 256) * 256},...).So I tried to fix this bug by revising the code to