KEMBAR78
Fix broken Llama4 accuracy in MoE part by nvpohanh · Pull Request #40609 · huggingface/transformers · GitHub
Skip to content

Conversation

@nvpohanh
Copy link
Contributor

@nvpohanh nvpohanh commented Sep 2, 2025

Llama4 accuracy is broken by a bug in
#39501 . It forgot to transpose the router_scores before applying it to routed_in, causing Llama4 to generate garbage output.

This PR fixes that issue by adding back the transpose() and adding some comments explaining why the transpose() is needed.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker please review since #39501 was made by you. Thanks!

Accuracy tests

Test script:

from transformers import pipeline
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

pipe = pipeline(
    "text-generation",
    model=model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

output = pipe("Roses are red,", max_new_tokens=200)

print(output)

Before the fix on H200:

[{'generated_text': 'Roses are red, 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8'}]

After the fix on H200:

[{'generated_text': "Roses are red, violets are blue, and here are some Valentine's Day-themed books to read with your boo! \n Celebrate Black History Month by reading books by Black authors and about Black culture!  These books are for kids and teens! \n These picture books celebrate Black History Month and are perfect for reading with your kids!  These books highlight Black history, cu... \n These books are all winners or nominees for major literary awards, including the Pulitzer Prize, the National Book Award, and t... \n 100 Book Challenge 2024: Read a Book Set in Another Country \n Check out these books that are set in different countries around the world!  Read a book set in another country and see what you... \n Explore the diverse world through these books that are set in different countries and cultures!  Read a book set in another coun... \n 100 Book Challenge 2024: Read a Book Written by an Author of Color \n Check out these books written by authors of color!"}]

nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 2, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Until we upgrade to a new HF transformers version containing the fix, we
will monkey patch HF transformers to make these tests pass again.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 2, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Until we upgrade to a new HF transformers version containing the fix, we
will monkey patch HF transformers to make these tests pass again.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
@Rocketknight1
Copy link
Member

cc @ArthurZucker!

@vasqu vasqu added the for patch Tag issues / labels that should be included in the next patch label Sep 3, 2025
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 3, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Until we upgrade to a new HF transformers version containing the fix, we
will monkey patch HF transformers to make these tests pass again.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
@FThompsonAWS
Copy link

FThompsonAWS commented Sep 3, 2025

Confirmed with the given repro that this issue also affects Llama4 CPU execution on transformers v4.54, v4.55, and v4.56. The output is accurate on v4.53.

@nvpohanh
Copy link
Contributor Author

nvpohanh commented Sep 4, 2025

@ArthurZucker Could you review this? Thanks!

Llama4 accuracy is broken by a bug in
huggingface#39501 . It forgot to
transpose the router_scores before applying it to routed_in, causing
Llama4 to generate garbage output.

This PR fixes that issue by adding back the transpose() and adding some
comments explaining why the transpose() is needed.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
@nvpohanh nvpohanh force-pushed the dev/nvpohanh/llama4-moe-fix branch from 004dd11 to 4bebe5a Compare September 4, 2025 01:32
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 4, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Until we upgrade to a new HF transformers version containing the fix, we
will monkey patch HF transformers to make these tests pass again.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this got lost! Thanks a lot for the fix!!

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama4

@Cyrilvallez Cyrilvallez merged commit 519c252 into huggingface:main Sep 4, 2025
22 checks passed
Cyrilvallez added a commit that referenced this pull request Sep 4, 2025
* Fix broken Llama4 accuracy in MoE part

Llama4 accuracy is broken by a bug in
#39501 . It forgot to
transpose the router_scores before applying it to routed_in, causing
Llama4 to generate garbage output.

This PR fixes that issue by adding back the transpose() and adding some
comments explaining why the transpose() is needed.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

* remove comment

---------

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 17, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 17, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 18, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 19, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 19, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 23, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 24, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 24, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 24, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 25, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 25, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 25, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
nvpohanh added a commit to nvpohanh/TensorRT-LLM that referenced this pull request Sep 26, 2025
… failures

The test_modeling_llama_min_latency.py::test_llama_allclose_to_hf tests
are failing with latest HF transformers due to a bug in their code.

A PR has been submitted to fix it in upstream repo:
huggingface/transformers#40609

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for patch Tag issues / labels that should be included in the next patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants