Fix broken Llama4 accuracy in MoE part #40609

nvpohanh · 2025-09-02T04:36:42Z

Llama4 accuracy is broken by a bug in
#39501 . It forgot to transpose the router_scores before applying it to routed_in, causing Llama4 to generate garbage output.

This PR fixes that issue by adding back the transpose() and adding some comments explaining why the transpose() is needed.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker please review since #39501 was made by you. Thanks!

Accuracy tests

Test script:

from transformers import pipeline
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

pipe = pipeline(
    "text-generation",
    model=model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

output = pipe("Roses are red,", max_new_tokens=200)

print(output)

Before the fix on H200:

[{'generated_text': 'Roses are red, 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8'}]

After the fix on H200:

[{'generated_text': "Roses are red, violets are blue, and here are some Valentine's Day-themed books to read with your boo! \n Celebrate Black History Month by reading books by Black authors and about Black culture!  These books are for kids and teens! \n These picture books celebrate Black History Month and are perfect for reading with your kids!  These books highlight Black history, cu... \n These books are all winners or nominees for major literary awards, including the Pulitzer Prize, the National Book Award, and t... \n 100 Book Challenge 2024: Read a Book Set in Another Country \n Check out these books that are set in different countries around the world!  Read a book set in another country and see what you... \n Explore the diverse world through these books that are set in different countries and cultures!  Read a book set in another coun... \n 100 Book Challenge 2024: Read a Book Written by an Author of Color \n Check out these books written by authors of color!"}]

… failures The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests are failing with latest HF transformers due to a bug in their code. A PR has been submitted to fix it in upstream repo: huggingface/transformers#40609 Until we upgrade to a new HF transformers version containing the fix, we will monkey patch HF transformers to make these tests pass again. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

Rocketknight1 · 2025-09-02T12:58:17Z

cc @ArthurZucker!

… failures The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests are failing with latest HF transformers due to a bug in their code. A PR has been submitted to fix it in upstream repo: huggingface/transformers#40609 Until we upgrade to a new HF transformers version containing the fix, we will monkey patch HF transformers to make these tests pass again. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

FThompsonAWS · 2025-09-03T23:45:59Z

Confirmed with the given repro that this issue also affects Llama4 CPU execution on transformers v4.54, v4.55, and v4.56. The output is accurate on v4.53.

nvpohanh · 2025-09-04T01:30:49Z

@ArthurZucker Could you review this? Thanks!

Llama4 accuracy is broken by a bug in huggingface#39501 . It forgot to transpose the router_scores before applying it to routed_in, causing Llama4 to generate garbage output. This PR fixes that issue by adding back the transpose() and adding some comments explaining why the transpose() is needed. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

… failures The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests are failing with latest HF transformers due to a bug in their code. A PR has been submitted to fix it in upstream repo: huggingface/transformers#40609 Until we upgrade to a new HF transformers version containing the fix, we will monkey patch HF transformers to make these tests pass again. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

Cyrilvallez

Indeed, this got lost! Thanks a lot for the fix!!

github-actions · 2025-09-04T20:06:42Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama4

* Fix broken Llama4 accuracy in MoE part Llama4 accuracy is broken by a bug in #39501 . It forgot to transpose the router_scores before applying it to routed_in, causing Llama4 to generate garbage output. This PR fixes that issue by adding back the transpose() and adding some comments explaining why the transpose() is needed. Signed-off-by: Po-Han Huang <pohanh@nvidia.com> * remove comment --------- Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

… failures The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests are failing with latest HF transformers due to a bug in their code. A PR has been submitted to fix it in upstream repo: huggingface/transformers#40609 Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

nvpohanh mentioned this pull request Sep 2, 2025

[https://nvbugs/5441729][test] Fix test_modeling_llama_min_latency.py failures NVIDIA/TensorRT-LLM#7478

Merged

1 task

lucaslie mentioned this pull request Sep 3, 2025

[AutoDeploy] Remove Llama 4 MoE Accuracy Patch NVIDIA/TensorRT-LLM#7494

Open

1 task

vasqu mentioned this pull request Sep 3, 2025

Llama-4-Scout-17B-16E-Instruct model perplexity anomaly when transformers==4.55.1 #40642

Closed

4 tasks

vasqu added the for patch Tag issues / labels that should be included in the next patch label Sep 3, 2025

nvpohanh force-pushed the dev/nvpohanh/llama4-moe-fix branch from 004dd11 to 4bebe5a Compare September 4, 2025 01:32

remove comment

3da9e63

Cyrilvallez approved these changes Sep 4, 2025

View reviewed changes

Cyrilvallez merged commit 519c252 into huggingface:main Sep 4, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix broken Llama4 accuracy in MoE part #40609

Fix broken Llama4 accuracy in MoE part #40609

Uh oh!

nvpohanh commented Sep 2, 2025

Uh oh!

Rocketknight1 commented Sep 2, 2025

Uh oh!

FThompsonAWS commented Sep 3, 2025 •

edited

Loading

Uh oh!

nvpohanh commented Sep 4, 2025

Uh oh!

Cyrilvallez left a comment

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix broken Llama4 accuracy in MoE part #40609

Fix broken Llama4 accuracy in MoE part #40609

Uh oh!

Conversation

nvpohanh commented Sep 2, 2025

Before submitting

Who can review?

Accuracy tests

Uh oh!

Rocketknight1 commented Sep 2, 2025

Uh oh!

FThompsonAWS commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvpohanh commented Sep 4, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

FThompsonAWS commented Sep 3, 2025 •

edited

Loading