[5321981] fix: Fix the Llama3.1 405B hanging issue. #5698

hyukn · 2025-07-03T06:25:01Z

The output shapes of the fusedLayerNormplugin for nvFP4 are mismatched. This pollutes the barrier buffer for the one-shot allreduce kernel, which causes the hanging issue.
This can also be the root cause of the accuracy issue as @zihaok recently mentioned. Because other data buffers are also messed up due to the out-of-range memory write.

hyukn · 2025-07-03T06:47:41Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-07-03T06:55:00Z

PR_Github #10770 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-03T10:01:48Z

PR_Github #10770 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #143 completed with status: 'FAILURE'

hyukn · 2025-07-03T12:22:14Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-07-03T12:27:51Z

PR_Github #10827 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-03T16:47:03Z

PR_Github #10827 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #148 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

hyukn · 2025-07-04T01:20:18Z

/bot run

tensorrt-cicd · 2025-07-04T01:25:51Z

PR_Github #10887 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-04T04:04:10Z

PR_Github #10887 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #157 completed with status: 'SUCCESS'

Correct the output shape of the fusedLayerNormPlugin. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

…#5698) (#5925) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>

…NVIDIA#5698) (NVIDIA#5925) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Yuxin <yuxinz@nvidia.com>

hyukn requested review from liji-nv and zihaok July 3, 2025 06:25

hyukn requested a review from a team as a code owner July 3, 2025 06:25

hyukn force-pushed the fix/5321981 branch 2 times, most recently from 6de8fa2 to d694659 Compare July 3, 2025 06:39

liji-nv approved these changes Jul 3, 2025

View reviewed changes

hyukn requested a review from litaotju July 3, 2025 06:58

litaotju approved these changes Jul 3, 2025

View reviewed changes

hyukn changed the title ~~[5321981] fix: Fix the Llama-405B hanging issue.~~ [5321981] fix: Fix the Llama 3.1-405B hanging issue. Jul 3, 2025

hyukn changed the title ~~[5321981] fix: Fix the Llama 3.1-405B hanging issue.~~ [5321981] fix: Fix the Llama3.1 405B hanging issue. Jul 3, 2025

[5321981] fix: Fix the Llama-405B hanging issue.

70720e8

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

hyukn force-pushed the fix/5321981 branch from d694659 to 70720e8 Compare July 4, 2025 01:20

hyukn merged commit b0354ef into NVIDIA:release/0.21 Jul 4, 2025
3 checks passed

dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 10, 2025

[5321981] fix: Fix the Llama3.1 405B hanging issue. (NVIDIA#5698)

c64d234

Correct the output shape of the fusedLayerNormPlugin. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 10, 2025

[5321981] fix: Fix the Llama3.1 405B hanging issue. (NVIDIA#5698)

23ef36f

Correct the output shape of the fusedLayerNormPlugin. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

nvzhihanj pushed a commit to nvzhihanj/TensorRT-LLM that referenced this pull request Jul 10, 2025

[5321981] fix: Fix the Llama3.1 405B hanging issue. (NVIDIA#5698)

d590b7f

Correct the output shape of the fusedLayerNormPlugin. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

nvzhihanj mentioned this pull request Jul 10, 2025

[nvbugs/5321981] Cherrypick fix: Fix the Llama3.1 405B hanging issue. (#5698) #5925

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[5321981] fix: Fix the Llama3.1 405B hanging issue. #5698

[5321981] fix: Fix the Llama3.1 405B hanging issue. #5698

Uh oh!

hyukn commented Jul 3, 2025 •

edited

Loading

Uh oh!

hyukn commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

hyukn commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

hyukn commented Jul 4, 2025

Uh oh!

tensorrt-cicd commented Jul 4, 2025

Uh oh!

tensorrt-cicd commented Jul 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[5321981] fix: Fix the Llama3.1 405B hanging issue. #5698

[5321981] fix: Fix the Llama3.1 405B hanging issue. #5698

Uh oh!

Conversation

hyukn commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyukn commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

hyukn commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

tensorrt-cicd commented Jul 3, 2025

Uh oh!

hyukn commented Jul 4, 2025

Uh oh!

tensorrt-cicd commented Jul 4, 2025

Uh oh!

tensorrt-cicd commented Jul 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hyukn commented Jul 3, 2025 •

edited

Loading